Hi I am looking for a tidyverse way of finding IDs (observations) that are present in every wave.
To elaborate on the data at hand:
We have mutliple waves (>20) that contain IDs and addtional measurements.
For exemplary purposes here is a simulation of our data at hand:
pacman::p_load(tidyverse)
wave1 <- tibble(
id = seq_along(1:100),
x = runif(100, 0, 100)
)
# In wave2 some observations drop out & some new observations are added
wave2 <- tibble(
id = seq_along(1:150),
x = runif(150, 0, 100)
)
# Simulation of Dropout
wave2 %>%
filter(!id %in% sample(1:150, 23)) -> wave2
# Same with Wave 3
wave3 <- tibble(
id = c(wave2 %>% pull(id),151:200),
x = runif(nrow(wave2) + 50, 0, 100)
)
# Simulation of Dropout
wave3 %>%
filter(!id %in% sample(1:200, 33)) -> wave3
I am looking for a handy way to find which IDs from, e.g., Wave1 are present in all other waves (in the example only wave2 and 3). And how many IDs from Wave2 are present in all further waves (in the example only wave3).
Solution (with a lot of writing effort)
An easy solution with a lot of writing effort for that many waves would be with inner_join
### Solution
wave1 %>%
inner_join(wave2, by = "id") %>%
nrow() # = 83
wave1 %>%
inner_join(wave3, by = "id") %>%
nrow() # = 68
But this seems like unnecessary repetition. Especially if you have more than 20 waves.
The waves allways follow the same naming: "wave" followed by up to 2 digits (e.g. "wave16").
Could it be possible to work with a character vector (e.g., waves <- c("wave1","wave2","wave3")
) of the names and a combination of a for loop and an eval
statement? OR with a list of data.frames and lapply?
Further ideas to solve it
I thought that a structure like a nested tibble could help with that.
So that I could just iterate over the rows.
Here is a simulation of the nested tibble
wave1 %>% write.csv(file = "wave1.csv")
wave2 %>% write.csv(file = "wave2.csv")
wave3 %>% write.csv(file = "wave3.csv")
files <- dir( pattern = "^wave\\d+.csv")
data <- tibble(filename = files) %>%
mutate(file_content = map(filename, ~ read_delim(file.path(.x)))) %>%
mutate(df = str_extract(filename,"wave\\d+")) %>%
relocate(df, .before = filename)
data
How the nested tibble would look like
However I am stuck with pull
and unnesting of the data.frames, as pull would give me list and unlist
does mess up the structure. Also I am not so sure about the for-loop to iterate over rows in a tidyverse approach
for (i in 1:nrow(data)) {
data[i,] %>%
pull(file_content)
}