2
votes

R group_by and count distinct values in dataframe column with condition, using mutate

This is similar to the question R group by | count distinct values grouping by another column but slightly different in that i want to mutate together with a condition. to count distinct values in each group, but also present in the vector c

id = c(  1,  1,  2,  2,  3,  4,  4,  5,  5,  5,  6,  6,  6,  6 )
val = c(100,100,200,300,400,500,500,500,600,600,200,200,300,500)
df <- data.frame(id=id, val=val)
c= c(200,500)

I know the following code is wrong, but it gives and idea of what i am trying to do. So, I want to create a fourth column, by grouping df by id, where the count of distinct values is given when it is present in c

x <- df %>% group_by(id) %>% 
  mutate(distinctValues = n_distinct(val %in% c ))

If we consider the id = 6, The distict values present in it are 200,300 and 500. But 300 is not present in c. So the count is = 2.

so the output column will be c(0,0,1,1,0,1,1,1,1,1,2,2,2,2)

2

2 Answers

2
votes

Since c is unique, you can approach it from the other way - count the number of c values that show up in val.

df %>% 
  group_by(id) %>% 
  mutate(distinctValues = sum(c %in% val))
# # A tibble: 14 x 3
# # Groups:   id [6]
#       id   val distinctValues
#    <dbl> <dbl>          <int>
#  1     1   100              0
#  2     1   100              0
#  3     2   200              1
#  4     2   300              1
#  5     3   400              0
#  6     4   500              1
#  7     4   500              1
#  8     5   500              1
#  9     5   600              1
# 10     5   600              1
# 11     6   200              2
# 12     6   200              2
# 13     6   300              2
# 14     6   500              2

You could also use distinctValues = sum(unique(val) %in% c) if that seems clearer - it might be a tad less efficient, but not enough to matter unless your data is massive.

1
votes

With n_distinct, we need to apply on the subset of rows

library(dplyr)
df %>% 
    group_by(id) %>% 
    mutate(distinctValues = n_distinct(val[val %in% c]))

-output

# A tibble: 14 x 3
# Groups:   id [6]
#      id   val distinctValues
#   <dbl> <dbl>          <int>
# 1     1   100              0
# 2     1   100              0
# 3     2   200              1
# 4     2   300              1
# 5     3   400              0
# 6     4   500              1
# 7     4   500              1
# 8     5   500              1
# 9     5   600              1
#10     5   600              1
#11     6   200              2
#12     6   200              2
#13     6   300              2
#14     6   500              2