How to count the number of participants with multiple observations/rows that have a combination of values in different rows of a column?

Question

I have a dataset were each participant has multiple observations (column name=id). In each observation each participant is given a diagnosis (column name=diagnosis).

I would like to count the number of participants who have a specific combination of diagnoses.

Please find a reproducible example of R code below. I have tried to group the data by id, filter by the combination of two diagnoses, and then count the number of participants, but this returns no data.

Do you see any solutions?

Thank you!

library(tidyverse)
id <- c(1,1,1,2,2,2,3,3,3)
diagnosis <- c("a101", "b101", "a101",
               "c101", "c101", "c101",
               "b101", "a101", "b101")
data <- data.frame(id, diagnosis, stringsAsFactors = FALSE)

n_a101_and_b101 <- data %>%
 group_by(id) %>%
 filter((substr(diagnosis,1,4)=="a101") & 
        (substr(diagnosis,1,4)=="b101")) %>%
 tally()

n_a101_and_b101

The & needs to be |. (substr(diagnosis, 1,4)=="a101") | (substr(diagnosis,1,4)=="b101"). diagnosis can not be a101 and b101 at the same time. — ricoderks
You don't need to use substr as you are trying to spot exact matches. — AntoniosK

AntoniosK AntoniosK · Accepted Answer · 2019-12-05T11:55:36

library(tidyverse)

id <- c(1,1,1,2,2,2,3,3,3)
diagnosis <- c("a101", "b101", "a101",
               "a101", "c101", "c101",
               "b101", "a101", "b101")

data <- data.frame(id, diagnosis, stringsAsFactors = FALSE)

data %>%
  group_by(id) %>%
  filter(sum(diagnosis == "a101") * sum(diagnosis == "b101") > 0) %>%
  distinct(id) %>%
  nrow()

# [1] 2

You can also use filter("a101" %in% diagnosis & "b101" %in% diagnosis) as an alternative.

How to count the number of participants with multiple observations/rows that have a combination of values in different rows of a column?

3 Answers