How to tell r the number of observations for a category

Question

I am making a graph from a table of data from a paper. It has a column of categories of relationships, then two columns of numerical variables: the number of observations for each category and then the iq correlation:

relation    num   corr
spouse      3817  0.33
MZ-twin-tog 4671  0.86
MZ-twin-ap   65   0.72
DZ-twin-tog 5546  0.6
sib-tog    26473 0.47
sib-ap      203 0.24
off-par    8433 0.42
off-midpar  992 0.5
off-par-ap  814 0.22

I want to make a boxplot of (corr ~ relation) but I want the widths to be proportional to the number of observations for each category. Unfortuntately varwidth = TRUE won't work because I effectively just have one observation per category since I'm not working with the full data set.

Does anyone know how to work with this since I don't have the complete data, just the results.

P.S. I know boxplot is not exactly an appropriate graph for this limited data set, but I don't know how else to display (numerical ~ categorical). Suggestions are welcome!

Thank you in advance for any advice!

You can't make a box plot with simple summary data like this. The size of a the regions of a box plot are defined by the minimum, maximum, median, and first/third quartiles of your data - box plots are intended to show the distribution of your data. What are you trying to illustrate by visualizing this summary? — Mako212
Thanks for the response. I wanted to visualize which groups have a higher correlation, while including something that illustrates which groups have a higher n (since the group sizes vary a lot). Looking at your bar graph below, I think that is exactly what I'm looking for only with my variable "corr" on the y axis and my variable "num" in the heat map on the right. I'm sure I can alter the code you provided to make the switch. Thank you so much, I really appreciate that. — Seth Watt

Mako212 Mako212 · Accepted Answer · 2017-09-18T04:40:29

Data:

df1 <- structure(list(relation = structure(c(9L, 3L, 2L, 1L, 8L, 7L, 
5L, 4L, 6L), .Label = c("DZ-twin-tog", "MZ-twin-ap", "MZ-twin-tog", 
"off-midpar", "off-par", "off-par-ap", "sib-ap", "sib-tog", "spouse"
), class = "factor"), num = c(3817L, 4671L, 65L, 5546L, 26473L, 
203L, 8433L, 992L, 814L), corr = c(0.33, 0.86, 0.72, 0.6, 0.47, 
0.24, 0.42, 0.5, 0.22), num_pct = c(0.0748225977182734, 0.0915631003254009, 
0.00127416003450033, 0.108715254635982, 0.518935978358882, 0.00397929980005489, 
0.165307562629866, 0.019445642372682, 0.015956404124358)), .Names = c("relation", 
"num", "corr", "num_pct"), row.names = c(NA, -9L), class = "data.frame")

Consider a bar plot like this (I mapped corr to color on both plots):

require(ggplot2)

g1 <- ggplot(df1, aes(relation, num))+
  geom_bar(aes(fill=corr),stat="identity")+
  theme_bw()

Or plotting the percent of each answer:

First calculate the percents:

df1$num_pct <- df1$num/sum(df1$num)

Then plot:

g2 <- ggplot(df1, aes(relation, num_pct))+
 geom_bar(aes(fill=corr),stat="identity")+
 scale_y_continuous(labels=scales::percent)+
 theme_bw()

How to tell r the number of observations for a category

1 Answers