2
votes

The classic example of a histogram is: x = defined bins of some continuous variable, y = frequency of those bins occurring.

My situation:

I have a data set with one column as U.S. zip codes and other columns with various statistics about those zip codes (two of which are median_household_income and population).

I want to make a histogram-type plot where the x axis is bins of the variable median_household_income (in increments of, say, $10,000) and where the y axis is something other than just frequency of those bins occurring--specifically avg population for those bins. (i.e. the populations of all zips in the, say, $40,000-$60,000 bin would averaged, and that population average would be how tall the bar is on the y axis).

The hist function as well as the histogram functions of ggplot2 don't seem to have an option for something to put y axis. It merely defaults to frequncy.

I have found some luck using ggplot2's ddply and geom_bar functions, which have allowed me to put population on the y axis using these two lines of code:

population = ddply(data, "median_household_income", summarise, population = mean(data$population))

ggplot(population, aes(x = factor(data$median_household_income), y = data$population)) + geom_bar(stat = "identity")

...but that doesn't allow me to designate bin sizes and thus group zip codes. It merely produces a separate bar for every zip code in my data set (which obviously makes it impossible to average populations for bins, since there aren't any bins in the first place).

Any help?

1
You can try using cut function to cut the continuous variable household income into ranges (factor variable) and plot those factors on x axis.Gopala
Without sample data to reproduce and work with, it is not possible to help you with working code.Gopala
A histogram is defined as a bar chart with a continuous variable on the x-axis and frequency on the y. You're just talking about a bar chart, thus geom_bar. Set your bins for your x variable with cut (or Hmisc::cut2 for more flexibility).alistaire

1 Answers

4
votes

Without a reproducible data set I can't demonstrate what you're specifically using, but I suspect you are looking to create a barplot from scratch. You need to create a column of the requisite bins, probably using cut on whatever factor you want to aggregate, do the aggregation yourself, then plot that.

Here's a toy example, aggregating by income bracket then calculating average population and median income within those;

data <- data.frame(population=c(10,20,14,12,32)*1e5, 
                   income=c(3.1,2.2,1.3,4.1,1.1)*1e5)
data$bins <- cut(data$income, breaks=c(0,1e5,2e5,3e5,4e5,5e5))

library(dplyr) ## I prefer dplyr myself, but note: not compatible with plyr
data2 <- data %>% 
  group_by(bins) %>% 
  summarise(avgpop=mean(population), medinc=median(income))

Plotting this as a bar plot of average population (by bin) against the bins, and adding the median income for each bin as text;

library(ggplot2)
ggplot(data2, aes(x=bins, y=avgpop, group=bins, fill=factor(bins))) + 
  geom_bar(stat="identity") + 
  geom_text(aes(x=bins, y=1e5, label=paste0("$",medinc)))

bar plot of average population by income bracket

EDIT: If you want to preserve all of the binned groups, not only the ones that have some data (since dplyr::summarise silently drops those) then you could merge the data back into the list of bins

data <- data.frame(population=c(10,20,14,12,32)*1e5, 
                   income=c(3.1,3.2,1.3,4.1,1.1)*1e5) ## modified for empty bin
data2 <- data %>% 
  group_by(bins) %>% 
  summarise(avgpop=mean(population), medinc=median(income))

data2 %>% merge(data, by="bins", all.y=TRUE)

data3 <- data2 %>% 
  merge(data.frame(bins=levels(data$bins)), by="bins", all=TRUE) %>%
  replace(is.na(.), 0)

ggplot(data3, aes(x=bins, y=avgpop, group=bins, fill=factor(bins))) + 
  geom_bar(stat="identity") + 
  geom_text(aes(x=bins, y=1e5, label=paste0("$",medinc)))

bar plot of average population by income bracket, empty bins preserved