2
votes

enter image description here

I would like some ideas on how to approach this interesting problem (to me at least). Let's say that I have a population with 3 different feature variables and some quantitative ratings with the population. An example is like the following:

df

   income expense education gender residence
1   153      2989 NoCollege      F       Own
2   289       872   College      F      Rent
3   551        98 NoCollege      M      Rent
4   286       320   College      M      Rent
5   259       372 NoCollege      M      Rent
6   631       221 NoCollege      M       Own
7   729       105   College      M      Rent
8   582       450 NoCollege      M       Own
9   570       253   College      F      Rent
10 1380       635 NoCollege      F      Rent
11  409       425 NoCollege      M      Rent
12  569       232 NoCollege      F       Own
13  317       856   College      M      Rent
14  199       283   College      F       Own
15  624       564 NoCollege      M       Own
16 1064       504 NoCollege      M       Own
17  821       169 NoCollege      F      Rent
18  402       175   College      M       Own
19  602       285   College      M      Rent
20  433       264   College      M      Rent
21  670       985 NoCollege      F       Own

I can do a calculation of spending-to-income ratio (SIR) over the segments defined by the 3 feature variables: education, gender and residence. So at the first level, no segmentation is done and the SIR is:

df %>% summarise(count=n(), spending_ratio=sum(expense)/sum(income)*100)
>>   count spending_ratio
   1    21           95.8

Then I break up the population into male and female groups, to get:

df %>% group_by(gender) %>% summarise(count=n(), spending_ratio=sum(expense)/sum(income)*100) 
>>   gender count spending_ratio
   1      F     8          138.0
   2      M    13           67.3

We continue this process by introducing education:

df %>% group_by(gender, education) %>% summarise(count=n(), spending_ratio=sum(expense)/sum(income)*100)
>>   gender education count spending_ratio
   1      F   College     3          133.1
   2      F NoCollege     5          139.4
   3      M   College     6           72.4
   4      M NoCollege     7           63.9

and finally adding residence:

df %>% group_by(gender, education, residence) %>% summarise(count=n(), spending_ratio=sum(expense)/sum(income)*100)
>>  gender education residence count spending_ratio
  1      F   College       Own     1          142.2
  2      F   College      Rent     2          131.0
  3      F NoCollege       Own     3          302.2
  4      F NoCollege      Rent     2           36.5
  5      M   College       Own     1           43.5
  6      M   College      Rent     5           77.3
  7      M NoCollege       Own     4           59.9
  8      M NoCollege      Rent     3           73.4

What I would like to achieve is to generate a treemap-like plot with all the above information included. But as you can see, the treemap plot is ways away from I want. What I want to get is a map that is similar to the image at the top, where the size of each rectangle represents the count and the color represent the SIR and all the levels of the tree are included.

Any help is deeply appreciated.

1

1 Answers

3
votes

You can use the treemap package to do the aggregation at the different levels, but then the output needs to be formatted quite a bit. When treemap does the successive aggregation, it drops all additional variables in the data.table. So, since your aggregation function requires additional variables, I created some dummy variables. The variable 'index' is used to index into 'expense' and 'income' from each subset. Here is how you could do it,

library(treemap)
library(data.table)

## Some dummy variables to aggregate by: ALL, i, and index
dat <- as.data.table(df)[, `:=`(total = factor("ALL"), i = 1, index = 1:.N)][]
indexList <- c('total', 'gender', 'education', 'residence')  # order or aggregation

## Function to aggregate at each grouping level (SIR)
agg <- function(index, ...) {
    dots <- list(...)
    expense <- dots[["expense"]][index]
    income <- dots[["income"]][index]
    sum(expense) / sum(income) * 100
}

## Get treemap data
res <- treemap(dat, index=indexList, vSize='i', vColor='index',
               type="value", fun.aggregate = "agg",
               palette = 'RdYlBu',
               income=dat[["income"]],
               expense=dat[["expense"]])  # ... args get passed to fun.aggregate

## The useful variables: level (corresponds to indexList), vSize (bar size), vColor(SIR)
## Create a label variable that is the value of the variable in indexList at each level
out <- res$tm
out$label <- out[cbind(1:nrow(out), out$level)]
out$label <- with(out, ifelse(level==4, substring(label, 1, 1), label))  # shorten labels
out$level <- factor(out$level, levels=sort(unique(out$level), TRUE))     # factor levels

## Time to find label positions, scale to [0, 1] first
## x-value is cumsum by group,  y will just be the level
out$xlab <- out$vSize / max(aggregate(vSize ~ level, data=out, sum)$vSize)
split(out$xlab, out$level) <- lapply(split(out$xlab, out$level), function(x) cumsum(x) - x/2)

## Make plot
library(ggplot2)
ggplot(out, aes(x=level, y=vSize, fill=color, group=interaction(level, label))) +
  geom_bar(stat='identity', position='fill') +  # add another for black rectangles but not legend
  geom_bar(stat='identity', position='fill', color="black", show_guide=FALSE) +
  geom_text(data=out, aes(x=level, y=xlab, label=label, ymin=0, ymax=1), size=6, font=2,
            inherit.aes=FALSE) +
  coord_flip() +
  scale_fill_discrete('SIR', breaks=out$color, labels = round(out$vColor)) +
  theme_minimal() +  # Then just some formatting 
  xlab("") + ylab("") +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

enter image description here

Edit

I think it actually works much better with gradient colors for SIR. To do that you just replace fill=color with fill=vColor and scale fill by some gradient.

## Make plot with gradient color for SIR
library(ggplot2)
ggplot(out, aes(x=level, y=vSize, fill=vColor, group=interaction(level, label))) +
  geom_bar(stat='identity', position='fill') +  # add another for black rectangles but not legend
  geom_bar(stat='identity', position='fill', color="black", show_guide=FALSE) +
  geom_text(data=out, aes(x=level, y=xlab, label=label, ymin=0, ymax=1), size=6, font=2,
            inherit.aes=FALSE) +
  coord_flip() +
  scale_fill_gradientn(colours = c("white", "red")) +
  theme_minimal() +  # Then just some formatting 
  xlab("") + ylab("") +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

enter image description here