1
votes

I have a dataset were each participant has multiple observations (column name=id). In each observation each participant is given a diagnosis (column name=diagnosis).

I would like to count the number of participants who have a specific combination of diagnoses.

Please find a reproducible example of R code below. I have tried to group the data by id, filter by the combination of two diagnoses, and then count the number of participants, but this returns no data.

Do you see any solutions?

Thank you!

library(tidyverse)
id <- c(1,1,1,2,2,2,3,3,3)
diagnosis <- c("a101", "b101", "a101",
               "c101", "c101", "c101",
               "b101", "a101", "b101")
data <- data.frame(id, diagnosis, stringsAsFactors = FALSE)

n_a101_and_b101 <- data %>%
 group_by(id) %>%
 filter((substr(diagnosis,1,4)=="a101") & 
        (substr(diagnosis,1,4)=="b101")) %>%
 tally()

n_a101_and_b101
3
The & needs to be |. (substr(diagnosis, 1,4)=="a101") | (substr(diagnosis,1,4)=="b101"). diagnosis can not be a101 and b101 at the same time.ricoderks
You don't need to use substr as you are trying to spot exact matches.AntoniosK

3 Answers

3
votes
library(tidyverse)

id <- c(1,1,1,2,2,2,3,3,3)
diagnosis <- c("a101", "b101", "a101",
               "a101", "c101", "c101",
               "b101", "a101", "b101")

data <- data.frame(id, diagnosis, stringsAsFactors = FALSE)

data %>%
  group_by(id) %>%
  filter(sum(diagnosis == "a101") * sum(diagnosis == "b101") > 0) %>%
  distinct(id) %>%
  nrow()

# [1] 2

You can also use filter("a101" %in% diagnosis & "b101" %in% diagnosis) as an alternative.

1
votes

I suspect there is a nicer tidyverse-style pipe answer, but I would do this via set operations:

Get the set of diagnosis_1 and get the set of diagnosis_2 and then find the intersection, i.e. the rows that are present in both sets.

a <- filter(data, diagnosis == "a101")
b <- filter(data, diagnosis == "b101")
intersect(a$id, b$id)
[1] 1 3
length(intersect(a$id, b$id))
[1] 2
1
votes

There might be several ways to achieve this. I just modified your code:

ab101 <- data %>% 
  group_by(id) %>% 
  mutate(yes = (max(diagnosis == "a101") + 
                  max(diagnosis == "b101"))==2) %>% 
  summarise(yes = yes[1])
sum(ab101$yes)

[1] 2