1
votes

Hi I am looking for a tidyverse way of finding IDs (observations) that are present in every wave.

To elaborate on the data at hand:
We have mutliple waves (>20) that contain IDs and addtional measurements.
For exemplary purposes here is a simulation of our data at hand:

pacman::p_load(tidyverse)
wave1 <- tibble(
  id = seq_along(1:100),
  x = runif(100, 0, 100)
)
# In wave2 some observations drop out & some new observations are added
wave2 <- tibble(
  id = seq_along(1:150),
  x = runif(150, 0, 100)
)
# Simulation of Dropout
wave2 %>%
  filter(!id %in% sample(1:150, 23)) -> wave2

# Same with Wave 3
wave3 <- tibble(
  id = c(wave2 %>% pull(id),151:200),
  x = runif(nrow(wave2) + 50, 0, 100)
)
# Simulation of Dropout
wave3 %>%
  filter(!id %in% sample(1:200, 33)) -> wave3

I am looking for a handy way to find which IDs from, e.g., Wave1 are present in all other waves (in the example only wave2 and 3). And how many IDs from Wave2 are present in all further waves (in the example only wave3).

Solution (with a lot of writing effort)

An easy solution with a lot of writing effort for that many waves would be with inner_join

### Solution
wave1 %>% 
  inner_join(wave2, by = "id") %>% 
  nrow() # = 83

wave1 %>% 
  inner_join(wave3, by = "id") %>% 
  nrow() # = 68

But this seems like unnecessary repetition. Especially if you have more than 20 waves. The waves allways follow the same naming: "wave" followed by up to 2 digits (e.g. "wave16"). Could it be possible to work with a character vector (e.g., waves <- c("wave1","wave2","wave3")) of the names and a combination of a for loop and an eval statement? OR with a list of data.frames and lapply?

Further ideas to solve it

I thought that a structure like a nested tibble could help with that.
So that I could just iterate over the rows.

Here is a simulation of the nested tibble

wave1 %>% write.csv(file = "wave1.csv")
wave2 %>% write.csv(file = "wave2.csv")
wave3 %>% write.csv(file = "wave3.csv")

files <- dir( pattern = "^wave\\d+.csv")
data <- tibble(filename = files) %>%
  mutate(file_content = map(filename, ~ read_delim(file.path(.x)))) %>%
  mutate(df =  str_extract(filename,"wave\\d+")) %>%
  relocate(df, .before = filename)
data

How the nested tibble would look like
enter image description here

However I am stuck with pull and unnesting of the data.frames, as pull would give me list and unlist does mess up the structure. Also I am not so sure about the for-loop to iterate over rows in a tidyverse approach

for (i in 1:nrow(data)) {
  data[i,] %>% 
    pull(file_content)
}
1

1 Answers

1
votes

We could do this in a loop. Get the values of the objects of interest in a list with mget (paste is used as we need only 'wave2', 'wave3' and not 'wave1' - or else can also do this automatically with ls(pattern = '^wave\\d+$'), but it may be less flexible - could also do ls(pattern = '^wave[2-3]$'), but this may gets messy with wave10 and so on ..)

library(dplyr)
library(purrr)
mget(paste0('wave', 2:3)) %>% 
   map( ~ wave1 %>% 
             inner_join(.x, by = "id") %>% 
             nrow)
$wave2
[1] 85

$wave3
[1] 69

For the second case with 'data' (not clear based on the code what kind of operations are being done). After looping over the sequence of rows of data, extract the 'file_content' list element with [[

for(i in seq_len(nrow(data))) {
    print(data$file_content[[i]])
}

From a tidyverse approach, we can access the data with either rowwise

data %>%
    rowwise %>%
    mutate(out = yourfun(file_content))

Or with map

library(purrr)
data %>%
    mutate(out = map(file_content, yourfun))