3
votes

I would like to pass the length of my group_by variable to summarize.

Example data

set.seed(112)
    df <- data.frame(
groupper = factor(sample.int(n = 12, size = 100, replace = TRUE)),
                     var = runif(100, min = 1, max = 25)
)

Now I have a different number of factors:

table(df[,1])
1  2  3  4  5  6  7  8  9 10 11 12 
8  7  4  8  9  7 10  7 11  3 13 13 

Now I would like to simply find the share of var in each groupper in certain intervals.

My code looks like this:

results <- df %>% group_by(groupper) %>% summarise(
var0_25 = sum(var < 25 / length(groupper)), 
var25_50 = sum(var >= 25 & var < 50) / length(groupper))
#etc...
)

But, how in the world do I get the correct group_by(groupper) length into my summarize? It changes for each factor.

3
aaaaaahhhhh, thanks!Thorst

3 Answers

4
votes

We can use n() to get the number of elements per group

library(dplyr)
df %>% 
    group_by(groupper) %>% 
    summarise(var0_25 = sum(var <25)/n(), 
              var25_50=sum(var >=25 & var < 50 )/n())
3
votes

I think a general solution when you want to calculate intervals is to use cut. This code is a bit longer but will work for any amount of intervals by just adjusting cut at your will. It will also save you manually writing column names an equasions

library(dplyr)
library(tidyr)
df %>%
  mutate(indx = cut(var, c(1, 25, 50), right = FALSE)) %>%
  group_by(groupper) %>%
  mutate(Count = n()) %>%
  group_by(groupper, indx) %>%
  summarise(Res = n()/Count[1L]) %>%
  spread(indx, Res)

# Source: local data frame [12 x 3]
# 
#    groupper    [1,25)   [25,50)
# 1         1 0.5000000 0.5000000
# 2         2 0.8571429 0.1428571
# 3         3 0.7500000 0.2500000
# 4         4 0.3750000 0.6250000
# 5         5 0.2222222 0.7777778
# 6         6 0.5714286 0.4285714
# 7         7 0.4000000 0.6000000
# 8         8 0.4285714 0.5714286
# 9         9 0.3636364 0.6363636
# 10       10 0.3333333 0.6666667
# 11       11 0.6153846 0.3846154
# 12       12 0.3076923 0.6923077
1
votes

But length(.) does also work. The problem with your code was that for var0_25 you messed up the brackets:

df %>% group_by(groupper) %>% 
    summarize(r = sum(var < 25) / length(groupper), 
              s = sum(var < 25), 
              l = length(groupper)) %>% 
    mutate(r2 = s / l)

Source: local data frame [12 x 5]

#    groupper r  s  l r2
# 1         1 1  8  8  1
# 2         2 1  7  7  1
# 3         3 1  4  4  1
# 4         4 1  8  8  1
# 5         5 1  9  9  1
# 6         6 1  7  7  1
# 7         7 1 10 10  1
# 8         8 1  7  7  1
# 9         9 1 11 11  1
# 10       10 1  3  3  1
# 11       11 1 13 13  1
# 12       12 1 13 13  1

I added columns s(for sum), l (for length) just to show that the results are indeed correct.