2
votes

Suppose I have a data frame:

df <- data.frame(SID=sample(1:4,15,replace=T), Var1=c(rep("A",5),rep("B",5),rep("C",5)), Var2=sample(2:4,15,replace=T))

which comes out to something like this:

    SID Var1 Var2
1     4    A    2
2     3    A    2
3     4    A    3
4     3    A    3
5     1    A    4
6     1    B    2
7     3    B    2
8     4    B    4
9     4    B    4
10    3    B    2
11    2    C    2
12    2    C    2
13    4    C    4
14    2    C    4
15    3    C    3

What I hope to accomplish is to find the count of unique SIDs (see below under update, this should have said count of unique (SID, Var1) combinations) where the given row's Var1 is excluded from this count and the count is grouped on Var2. So for the example above, I would like to output:

    SID Var1 Var2 Count.Excluding.Var1
1     4    A    2                    3
2     3    A    2                    3
3     4    A    3                    1
4     3    A    3                    1
5     1    A    4                    3
6     1    B    2                    3
7     3    B    2                    3
8     4    B    4                    3
9     4    B    4                    3
10    3    B    2                    3
11    2    C    2                    4
12    2    C    2                    4
13    4    C    4                    2
14    2    C    4                    2
15    3    C    3                    2

For the 1st observation, we have a count of 3 because there are 3 unique combinations of (SID, Var1) for the given Var2 value (2, in this case) where Var1 != A (Var1 value of 1st observation) -- specifically, the count includes observation 6, 7 and 11, but not 12 because we already accounted for a (SID, Var1)=(2,C) and not row 2 because we do not want Var1 to be "A". All of these rows have the same Var2 value.

I'd preferably like to use dplyr functions and the %>% operator. &

UPDATE

I apologize for the confusion and my incorrect explanation above. I have corrected what I intended on asking for in the paranthesis, but I am leaving my original phrasing as well because majority of answers seem to interpret it this way.

As for the example, I apologize for not setting the seed. There seems to have been some confusion with regards to the Count.Excluding.Var1 for rows 11 and 12. With unique (SID, Var1) combinations, rows 11 and 12 should make sense as these count rows 1,2,6, and 7 xor 8.

5
What is use of Var2 in above logic?MKR
this is a generalized example I've made for a binary naive bayes problem where Var2 is a feature for a class denoted by Var1, and SIDs resemble observations. So what I am actually trying to do is find all unique observations containing Var2 value that is not of Var1 class.shotes
As @MKR mentioned, the logic is not clear. Please check if the expected output is the correct oneakrun
Line 11 and 12 are correct ? I think it should be 3 for Count.Excluding.Var1DJack
Why count is 1 for 3rd row?MKR

5 Answers

2
votes

A simple mapply can do the trick. But as OP requested for %>% based solution, an option could be as:

df %>% mutate(Count.Excluding.Var1 = 
  mapply(function(x,y)nrow(unique(df[df$Var1 != x & df$Var2 == y,1:2])),.$Var1,.$Var2))
#     SID Var1 Var2 Count.Excluding.Var1
# 1    4    A    2                    3
# 2    2    A    3                    3
# 3    4    A    4                    3
# 4    4    A    4                    3
# 5    3    A    4                    3
# 6    4    B    3                    1
# 7    3    B    3                    1
# 8    3    B    3                    1
# 9    4    B    2                    3
# 10   2    B    3                    1
# 11   2    C    2                    2
# 12   4    C    4                    2
# 13   1    C    4                    2
# 14   1    C    2                    2
# 15   3    C    4                    2

Data:

The above results are based on origional data provided by OP.

df <- data.frame(SID=sample(1:4,15,replace=T), Var1=c(rep("A",5),rep("B",5),rep("C",5)), Var2=sample(2:4,15,replace=T))
0
votes

could not think of a dplyr solution, but here's one with apply

df$Count <- apply(df, 1, function(x) length(unique(df$SID[(df$Var1 != x['Var1']) & (df$Var2 == x['Var2'])])))
#     SID Var1 Var2 Count
# 1    4    A    2     3
# 2    3    A    2     3
# 3    4    A    3     1
# 4    3    A    3     1
# 5    1    A    4     2
# 6    1    B    2     3
# 7    3    B    2     3
# 8    4    B    4     3
# 9    4    B    4     3
# 10   3    B    2     3
# 11   2    C    2     3
# 12   2    C    2     3
# 13   4    C    4     2
# 14   2    C    4     2
# 15   3    C    3     2
0
votes

Here is a dplyr solution, as requested. For future reference, please use set.seed so we can reproduce your desired output with sample, else I have to enter data by hand...

I think this is your logic? You want the n_distinct(SID) for each Var2, but for each row, you want to exclude rows which have the same Var1 as the current row. So a key observation here is row 3, where a simple grouped summarise would yield a count of 2. Of the rows with Var2 = 3, row 3 has SID = 4, row 4 has SID = 3, row 15 has SID = 3, but we don't count row 3 or row 4, so final count is one unique SID.

Here we get first the count of unique SID for each Var2, then the count of unique SID for each Var1, Var2 combo. First count is too large by the amount of additional unique SID for each combo, so we subtract it and add one. There is an edge case where for a Var1, there is only one corresponding Var2. This should return 0 since you exclude all the possible values of SID. I added two rows to illustrate this.

library(tidyverse)
df <- read_table2(
  "SID Var1 Var2
4    A    2
3    A    2
4    A    3
3    A    3
1    A    4
1    B    2
3    B    2
4    B    4
4    B    4
3    B    2
2    C    2
2    C    2
4    C    4
2    C    4
3    C    3
1    D    5
2    D    5"
)

df %>%
  group_by(Var2) %>%
  mutate(SID_per_Var2 = n_distinct(SID)) %>%
  group_by(Var1, Var2) %>%
  mutate(SID_per_Var1Var2 = n_distinct(SID)) %>%
  ungroup() %>% 
  add_count(Var1) %>%
  add_count(Var1, Var2) %>%
  mutate(
    Count.Excluding.Var1 = if_else(
      n > nn,
      SID_per_Var2 - SID_per_Var1Var2 + 1,
      0
    )
  ) %>%
  select(SID, Var1, Var2, Count.Excluding.Var1)
#> # A tibble: 17 x 4
#>      SID Var1   Var2 Count.Excluding.Var1
#>    <int> <chr> <int>                <dbl>
#>  1     4 A         2                   3.
#>  2     3 A         2                   3.
#>  3     4 A         3                   1.
#>  4     3 A         3                   1.
#>  5     1 A         4                   3.
#>  6     1 B         2                   3.
#>  7     3 B         2                   3.
#>  8     4 B         4                   3.
#>  9     4 B         4                   3.
#> 10     3 B         2                   3.
#> 11     2 C         2                   4.
#> 12     2 C         2                   4.
#> 13     4 C         4                   2.
#> 14     2 C         4                   2.
#> 15     3 C         3                   2.
#> 16     1 D         5                   0.
#> 17     2 D         5                   0.

Created on 2018-04-12 by the reprex package (v0.2.0).

0
votes

Here's a solution using purrr - you can wrap this in a mutate statement if you want, but I don't know that it adds much in this particular case.

library(purrr)
df$Count.Excluding.Var1 = map_int(1:nrow(df), function(n) {
  df %>% filter(Var2 == Var2[n], Var1 != Var1[n]) %>% distinct() %>% nrow()
})

(Updated with input from comments by Calum You. Thanks!)

0
votes

A 100% tidyverse solution:

library(tidyverse) # dplyr + purrr
df  %>%
  group_by(Var2) %>%
  mutate(count = map_int(Var1,~n_distinct(SID[.x!=Var1],Var1[.x!=Var1])))

# # A tibble: 15 x 4
# # Groups:   Var2 [3]
#      SID  Var1  Var2 count
#    <int> <chr> <int> <int>
#  1     4     A     2     3
#  2     3     A     2     3
#  3     4     A     3     1
#  4     3     A     3     1
#  5     1     A     4     3
#  6     1     B     2     3
#  7     3     B     2     3
#  8     4     B     4     3
#  9     4     B     4     3
# 10     3     B     2     3
# 11     2     C     2     4
# 12     2     C     2     4
# 13     4     C     4     2
# 14     2     C     4     2
# 15     3     C     3     2