0
votes

I have a large nested data set with the following structure:

> str(Normalized_All)
List of 48
 $ Traces/Sample10_1_D.csv:'data.frame':    2988 obs. of  2 variables:
  ..$ Time_min    : num [1:2988] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
  ..$ Sample10_1_D: num [1:2988] 0 0 0 0 0 0 0 0 0 0 ...
 $ Traces/Sample10_1_L.csv:'data.frame':    2965 obs. of  2 variables:
  ..$ Time_min    : num [1:2965] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
  ..$ Sample10_1_L: num [1:2965] 0 0 0 0 0 0 0 0 0 0 ...
 $ Traces/Sample10_1_R.csv:'data.frame':    2962 obs. of  2 variables:
  ..$ Time_min    : num [1:2962] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
  ..$ Sample10_1_R: num [1:2962] 0 0 0 0 0 0 0 0 0 0 ...

I want to essentially full_join all the data into one tibble, but essentially by recursively using dplyr::full_join(x,y, by="Time_min"). I want to use full_join because not every Time_min column is the same length, but there are many overlapping time points. Each 'Sample' Column has a unique column name. I essentially want to minimize rows with many NAs.

Is there an elegant way to do this? Preferably using dplyr or related tidyverse packages.

1
Please make an example tibble so we have something to work with (perhaps by running the code that generates this on tables you've run "head" with in Bash and then running dput() on that df). I suspect there's a better (faster) way to do this if we tidy the table, but I'd like to try to test it.GenesRus

1 Answers

1
votes

You could perhaps try simply:

all_data <- Normalized_all %>% Reduce( f=full_join )

This would sequentially full_join them on to eachother.

If you get memory issues from the above, you could try using data.table instead:

library(data.table)

all_data2 <- Reduce( x=Normalized_All, f=function(a,b) {
    setDT(b)
    if( is.null(a)) {
        return(b)
    }
    merge( a, b, by="Time_min", all=TRUE )
}, init=NULL)