Display ranges and sum until the condition is met

Question

I have an excel file which has two columns, the first one is for customers and the second is the revenue generated from them. Let's say that my total revenue is 1000. I need to divide this total revenue in 5 buckets, i.e, 20% of total revenue(0-200), 40% of total revenue(200-400), 60% of total revenue(400-600), 80% of total revenue(600-800) and 100% of total revenue(800-1000). I want to count the number of customers within each bucket ranges, for example, how many customers lie in the range where sum of revenue<20% of total revenue and so on for other ranges and finally plot them using a bar graph. How can I do this in R? Below is a sample data:

 Customer   Revenue
    a          230
    b          170
    c          809
    d          435
    e          678
    f          350
    g          465
    h          990
    i          767
    j          500

Since it is not clear from your example. Do you want to have your customers sorted by their revenue, hence the customers with the least revenue are in the group of total revenue <20%? And why do you need R for this if it is an Excel File. Excel has a sum function and a cumulative sum function too — Ben373

RandowMalk RandowMalk · Accepted Answer · 2019-02-11T16:27:43

Original data:

df <- tibble(Customer = letters[1:10], Revenue = c(230, 170, 809, 435, 678, 350, 465, 990, 767, 500))

library(dplyr)
library(ggplot2)

Order the data.frame by ascending Revenue

df <- df %>% 
  arrange(Revenue)

Use the cut() function from R base to add a variable that displays in which of the 5 bins the cumulated Revenue is located. Then plot using the x-axis for the bins, and y-axis for the number of values for that categorical variable.

    df %>% 
      mutate(Revenue_Cumulated = cumsum(Revenue)/sum(Revenue)) %>% 
      mutate(bins = cut(Revenue_Cumulated, breaks = seq(0, 1, 0.2))) %>% 
      group_by(bins) %>% 
      summarise(n = n()) %>% 
      mutate(cumulated_n = cumsum(n)) %>% 

    # data.frame at that point in the code:
    # A tibble: 5 x 3
    #  bins          n cumulated_n
    #  <fct>     <int>       <int>
    # 1 (0,0.2]       3           3
    # 2 (0.2,0.4]     3           6
    # 3 (0.4,0.6]     1           7
    # 4 (0.6,0.8]     1           8
    # 5 (0.8,1]       2          10

    gather(key, value, -bins) %>% 
    ggplot(aes(x = bins, y = value, fill = key)) +
    geom_col(position = "dodge")+
    geom_text(aes(label=value),position=position_dodge(width=0.9),vjust=-0.25)

the cumulated_n now will tell how many customers contribute to the 0-X percentages. The gather function is there to transform the data in a longer format, which makes it easiere to treat "n" and "cumulated_n" as factors to highlight the difference in the graph.

Number_customers_by_bin

Display ranges and sum until the condition is met

2 Answers