4
votes

I have a dataframe in a long format and I want to filter pairs based on unique combinations of values. I have a dataset that looks like this:

id <- rep(1:4, each=2)
type <- c("blue", "blue", "red", "yellow", "blue", "red", "red", "yellow")
df <- data.frame(id,type)
df
  id   type
1  1   blue
2  1   blue
3  2    red
4  2 yellow
5  3   blue
6  3    red
7  4    red
8  4 yellow

Let's say each id is a respondent and type is a combination of treatments. Individual 1 saw two objects, both of them blue; individual 2 saw one red object and a yellow one; and so on.

How do I keep, for example, those that saw the combination "red" and "yellow"? If I filter by the combination "red" and "yellow" the resulting dataset should look like this:

  id   type
3  2    red
4  2 yellow
7  4    red
8  4 yellow

It should keep respondents number 2 and number 4 (only those that saw the combination "red" and "yellow"). Note that it does not keep respondent number 3 because she saw "blue" and "red" (instead of "red" and "yellow"). How do I do this?

One solution is to reshape the dataset into a wide format, filter it by column, and restack again. But I am sure there is another way to do it without reshaping the dataset. Any idea?

4

4 Answers

8
votes

A dplyr solution would be:

library(dplyr)
df <- data_frame(
  id = rep(1:4, each = 2),
  type = c("blue", "blue", "red", "yellow", "blue", "red", "red", "yellow")
)

types <- c("red", "yellow")

df %>% 
  group_by(id) %>% 
  filter(all(types %in% type))
#> # A tibble: 4 x 2
#> # Groups:   id [2]
#>      id   type
#>   <int>  <chr>
#> 1     2    red
#> 2     2 yellow
#> 3     4    red
#> 4     4 yellow

Update

Allowing for the equal combinations, e.g. blue, blue, we have to change the filter-call to the following:

types2 <- c("blue", "blue")

df %>% 
  group_by(id) %>% 
  filter(sum(types2 == type) == length(types2))
#> # A tibble: 2 x 2
#> # Groups:   id [1]
#>      id  type
#>   <int> <chr>
#> 1     1  blue
#> 2     1  blue

This solution also allows different types

df %>% 
  group_by(id) %>% 
  filter(sum(types == type) == length(types))
#> # A tibble: 4 x 2
#> # Groups:   id [2]
#>      id   type
#>   <int>  <chr>
#> 1     2    red
#> 2     2 yellow
#> 3     4    red
#> 4     4 yellow
4
votes

Let's use all() to see if all rows within group match a set of values.

library(tidyverse)

test_filter <- c("red", "yellow")

df %>%
  group_by(id) %>% 
  filter(all(test_filter %in% type))

# A tibble: 4 x 2
# Groups: id [2]
id type  
<int> <fctr>
1     2 red   
2     2 yellow
3     4 red   
4     4 yellow
1
votes

I modified your data and did the following.

df <- data.frame(id = rep(1:4, each=3),
                 type <- c("blue", "blue", "green", "red", "yellow", "purple",
                           "blue", "orange", "yellow", "yellow", "pink", "red"),
                 stringsAsFactors = FALSE)

   id   type
1   1   blue
2   1   blue
3   1  green
4   2    red
5   2 yellow
6   2 purple
7   3   blue
8   3 orange
9   3 yellow
10  4 yellow
11  4   pink
12  4    red

As you see, there are three observations for each id. id 2 and 4 have both red and yellow. They also have non-target colors (i.e., purple, and pink). I wanted to preserve these observations. In order to achieve this task, I wrote the following code. The code can be read like this. "For each id, check if there is any red and yellow using any(). When both conditions are TRUE, keep all rows for the id."

group_by(df, id) %>%
filter(any(type == "yellow") & any(type == "red"))

   id   type
4   2    red
5   2 yellow
6   2 purple
10  4 yellow
11  4   pink
12  4    red
0
votes

Using data.table:

library(data.table)
setDT(df)
df[, type1 := shift(type, type = "lag"), by = id]
df1 <- df[type == "yellow" & type1 == "red", id]
df <- df[id %in% df1, ]
df[, type1 := NULL]

It gives:

   id   type
1:  2    red
2:  2 yellow
3:  4    red
4:  4 yellow