0
votes

I am using a for loop to merge multiple files with another file:

files <- list.files("path", pattern=".TXT", ignore.case=T)

for(i in 1:length(files))
{
  data <- fread(files[i], header=T)

  # Merge
  mydata <- merge(mydata, data, by="ID", all.x=TRUE)

  rm(data)
}

"mydata" looks as follows (simplified):

ID  x1  x2
1   2   8
2   5   5
3   4   4
4   6   5
5   5   8

"data" looks as follows (around 600 files, in total 100GB). Example of 2 (seperate) files. Integrating all in 1 would be impossible (too large):

ID  x3
1   8
2   4

ID  x3
3   4
4   5
5   1

When I run my code I get the following dataset:

ID  x1  x2  x3.x    x3.y
1   2   8   8       NA
2   5   5   4       NA
3   4   4   NA      4
4   6   5   NA      5
5   5   8   NA      1

What I would like to get is:

ID  x1  x2  x3
1   2   8   8
2   5   5   4
3   4   4   4
4   6   5   5
5   5   8   1

ID's are unique (never duplicates over the 600 files).

Any idea on how to achieve this as efficiently as possible much appreciated.

1
Are you trying to merge all the text files into single text file ? If the ID are unique, why not use rbind or cbind for joining them instead of merging ? - user5249203
I don't think there is a function that does 'merging' the way you want it for your data structure. We might have to write one. Your data file does not necessarily always only contain the column 'x3' right? - Vlo
data is always the exact same, 1 column "ID", 1 column "x3". I only have more IDs in the seperate data files, than in mydata - research111

1 Answers

5
votes

It's better suited as comment, But I can't comment yet.

Would it not be better to rbind instead of merge? This seems to be what you want to acomplish.

Set fill argument TRUE to take care of different column numbers:

asd <- data.table(x1 = c(1, 2), x2 = c(4, 5))
a <- data.table(x2 = 5)
rbind(asd, a, fill = TRUE)

   x1 x2
1:  1  4
2:  2  5
3: NA  5

Do this with data and then merge into mydata by ID.

Update for comment

files <- list.files("path", pattern=".TXT", ignore.case=T)

ff <- function(input){
  data <- fread(input) 
}

a <- lapply(files, ff)
library(plyr)
binded.data <- ldply(a, function(x) rbind(x, fill = TRUE))

So, this creates a function to read files and pushes it to lapply, so you will get a list containing all your data files, each on its own dataframe.

With ldply from plyr rbind all dataframes into one dataframe.

Don't touch mydata yet.

binded.data <- data.table(binded.data, key = ID)

Depending on your mydata you will perform different merge commands. See: https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html

Update 2

files <- list.files("path", pattern=".TXT", ignore.case=T)

ff <- function(input){
data <- fread(input)
# This keeps only the rows of 'data' whose ID matches ID of 'mydata'
data <- data[ID %in% mydata[, ID]]
}

a <- lapply(files, ff)
library(plyr)
binded.data <- ldply(a, function(x) rbind(x, fill = TRUE))

Update 3

You can add cat to see the file the function is reading right now. So you can see after which file you are running out of memory. Which will point you to the direction on how many files you can read in one go.

  ff <- function(input){
# This will print name of the file it is reading now
cat(input, "\n")
data <- fread(input)
# This keeps only the rows of 'data' whose ID matches ID of 'mydata'
data <- data[ID %in% mydata[, ID]]
}