0
votes

I have millions of csv files with different headers, I want to merge them in to a big dataframe.

My problem is that the solution I have tried worked, but too slow! By the way, I have access to Sparklyr working on multiple-nodes cluster in my lab, will this big data tool help?

The files look like this:

file1

header1,header3,header5

a,b,c

file2

header4,header2

e,f

file3

header2,header6

a,c

I want to merge them into:

header1,header2,header3,header4,header5,header6

a,,b,,c, ,f,,e,, ,a,,,,c

I have tried to bind them directly with R, but the program crashed after running few days in the server. The code looks like this:

library(plyr)
library(dplyr)
library(readr)


csvfiles <- list.files(pattern = "file\\d+.csv") 

for (i in 1:length(csvfiles)) {
  assign(paste0("files", i),read_csv(csvfiles[[i]]))
}

csvlist <- mget(ls(pattern = "files\\d"))

result <- data.frame()

for (i in 1:length(csvlist)){
  my_list <- list(result,csvlist[[i]])
  result <- rbindlist(my_list,use.names=TRUE, fill=TRUE)
 }

Then I tried to firstly extract a header using command-line tool such as sed, awk and csvtk.The code I used looks like this

for file in $(ls file*.csv); do cat $file | sed "2 d" | csvtk transpose   >> name_combined.csv; done

awk  '{ lines[$1] = $0 } END { for (l in lines) print lines[l] }' name_combined.csv >> long_head.csv

And I get the csv file named as long_head.csv, it looks like this (actually I got over 3 thousand columns)

header1,header2,header3,header4,header5,header6

Then I use bind_rows in dplyr. I want to output multiple csv files with same columns firstly and then combine them all.

library(readr)
library(dplyr)

csvfiles <- list.files(pattern = "file\\d+.csv")
long_head <- read_csv("long_head.csv")

new_file <- paste("new_file",1:length(csvfiles),sep = "")

for (i in 1:length(csvfiles)) {
         bind_rows(long_head,read_csv(csvfiles[[i]]))  %>% 
            write_csv (file = paste0(new_file [[i]], ".csv"))
}

The code can only output about 100k csv files in a day which mean I have to wait for a whole month to get this csv files to merge them.

I have also tried to combine them directly without writing multiple csv files:

library(readr)
library(dplyr)

csvfiles <- list.files(pattern = "file\\d+.csv")

long_head <- read_csv("long_head.csv")


for (i in 1:length(csvfiles)) {
  a <- bind_rows(read_csv(csvfiles[[i]]),long_head)
  result <- rbind(a,long_head)
}

It runs faster but also falls behind my expectation.

3
Try data.table::rbindlist()Rohit
Hi, what did you try yourself? You mention that you have some code that runs to slow. Can you show it too us?kvantour
First of all, transpose all files. So each row became column. Then start apending files, one by one. After that, transpose you huge file, so each column became row. And voila, you have merged files as you want404pio
I have specificed my question... My problem is that I cannot read all my csv files into R at once, thus the speed is quite slow.Zhiliang Lin

3 Answers

2
votes

This is a challenging question which requires to consider speed as well as memory consumption.

If I understand correctly, the OP wants to merge millions of small csv files. According to the sample data, each file consists of just 2 rows: the header in the first row and character data in the second row. The number of columns as well as the column names may vary from file to file. However, all columns are of the same data type character.

Both OP's first attempt as well as M. Viking's answer are growing the result object iteratively. This is very inefficient as it requires to copy the same data over and over again. In addition, both are using read_csv() from the readr package which is not the fastest csv reader either.

To avoid growing the result object iteratively, all files are read into a list which is then combined using rbindlist() in one go. The final result is then stored as csv file:

library(data.table)
file_names <- list.files(pattern = "file\\d+.csv")
result <- rbindlist(lapply(file_names, fread), use.names=TRUE, fill=TRUE)
fwrite(result, "result.csv")

From OP's expected result it seems that the columns should be ordered by column name. This can be achieved by

library(magrittr)
setcolorder(result, names(result) %>% sort())

which rearranges the columns of a data.table object by reference, i.e., without copying the whole object.

Performance

Now, let's look at processing time. For benchmarking, I have created 100k files (see Data section below) which is far from the volume the OP is aiming at but will allow to draw conclusions.

On my PC, the overall processing time was about 5 minutes:

bench::workout({
  fn <- list.files(pattern = "file\\d+.csv")
  tmp_list <- lapply(fn, function(x) fread(file = x, sep =",", header = TRUE, colClasses = "character") )
  result <- rbindlist(tmp_list, use.names=TRUE, fill=TRUE)
  setcolorder(result, names(result) %>% sort())
  fwrite(result, "result.csv")
}, 1:5)
# A tibble: 5 x 3                                                                                                                    
  exprs       process     real
  <bch:expr> <bch:tm> <bch:tm>
1 1           562.5ms 577.19ms
2 2             1.81m    4.52m
3 3            14.05s   15.55s
4 4           15.62ms  175.1ms
5 5              2.2s    7.72s

Here, I have used the workout() function from the bench package to time the single expressions in order to identify the statement which takes the most time. The bulk takes reading in the csv files.

Also the object sizes are of importance. The combined data.table result with 100k rows and 1000 columns occupies 800 MB, the temporary list only 11%. This is due to the many empty cells.

pryr::object_size(result)
800 MB
pryr::object_size(tmp_list)
87.3 MB

By the way, the result file "result.csv" has a size of 98 MBytes on disk.

Conclusion

Computing time seems not be the main issue but the memory required to store the result.

If reading 100k files takes about 5 minutes I guess that reading 1M files may take about 50 minutes.

But, for 1M files with 3000 columns the resulting data.table may require 10 * 3 = 30 times more memory which is 24 GBytes. The temporary list may take only about 900 MBytes. So, it might be worthwhile to re-consider the resulting data structure.

Benchmarking different file readers

Above timings show that over 90% of the computing time is spent for reading the data files. Therefore, it is appropriate to benchmark different methods for reading CSV files:

  • read.csv() from base R
  • read_csv() from the readr package
  • fread() from the data.table package

As a convenience to the user, all three functions offer to guess certain characteristics of the file such as field delimiters or data types. This may require additional computing time. Therefore, the functions are benchmarked also with explicitely stated file parameters.

For benchmarking, the bench package is used because it measures also the allocated memory which might be the other limiting factor besides computing time. The benchmark is repeated for different numbers of files in order to study the effect on memory consumption.

library(data.table)
library(readr)
file_names <- list.files(pattern = "file\\d+.csv")
bm <- press(
  n_files = c(1000, 2000, 5000, 10000),
  {
    fn <- file_names[seq_len(n_files)] 
    mark(
      fread = lapply(fn, fread),
      fread_p = lapply(fn, function(x) fread(file = x, sep =",", header = TRUE, colClasses = "character")),
      # fread_pp = lapply(fn, fread, sep =",", header = TRUE, colClasses = "character"),
      read.csv = lapply(fn, read.csv),
      read.csv_p = lapply(fn, read.csv, colClasses = "character"),
      read_csv = lapply(fn, read_csv),
      read_csv_p = lapply(fn, read_csv, col_types = cols(.default = col_character())),
      check = FALSE,
      min_time = 10
    )
  }
)

The results are visualised by

library(ggplot2)
ggplot(bm) + aes(n_files, median, color = names(expression)) + 
  geom_point() + geom_line() + scale_x_log10()
ggsave("median.png")
ggplot(bm) + aes(n_files, mem_alloc, color = names(expression)) + 
  geom_point() + geom_line() + scale_x_log10()
ggsave("mem_alloc.png")
ggplot(bm) + aes(median, mem_alloc, color = names(expression)) + geom_point() + 
  facet_wrap(vars(n_files))
ggsave("mem_allov_vs_median.png")

When comparing median execution times we can observe (please, note the double logarithmic scale) that

  • computing times are increasing almost linearly with the number of files;
  • explicitely passing file parameters (timings are named ..._p) always gives a performance gain vs guessing the parameters, in particular for read_csv();
  • for reading many small files read.csv() is faster than fread() while read_csv() is slower by factors.

enter image description here

When comparing allocated memory we can observe (again, note the double logarithmic scale) that

  • memory comsumption is increasing almost linearly with the number of files;
  • explicitely passing file parameters (timings are named ..._p) does not have a significant effect on memory allocation except for read_csv() where guessing parameters seems to be rather costly;
  • fread() allocates remarkably less memory than the two other other readers.

enter image description here

As mentioned above speed and memory consumption might be crucial here. Now, read.csv() seems to be the best choice in terms of speed while fread() is best in terms of memory comsumption as can be seen from the facetted scatter plot below.

My personal choice was to prefer fread() (less memory consumption) over read.csv() (faster) because RAM on my PC is limited and cannot be extended easily. Your mileage may vary.

enter image description here

Data

Below code was used to create 100k sample files:

library(magrittr)   # piping used to improve readability
n_files <- 10^4L
max_header <- 10^3L
avg_cols <- 4L
headers <- sprintf("header%02i", seq_len(max_header))
set.seed(1L)   # to ensure reproducible results
for (i in seq_len(n_files)) {
  n_cols <- rpois(1L, avg_cols - 1L) + 1L # exclude 0
  header <- sample(headers, n_cols)
  file_name <- sprintf("file%i.csv", i)
  sample(letters, n_cols, TRUE) %>% 
    as.list() %>% 
    as.data.frame() %>%
    set_names(header) %>% 
    data.table::fwrite(file_name)
}
2
votes

Below you find a method using GNU awk, it will read the files ones completely. It will do the following:

  • Read the header of each file, and close the file
  • If a new header element is discovered, add it to the end of the currently known elements. Eg, the following headers exist:

    file1: A,B,D
    file2: A,C,E
    file3: A,E,D
    

    output header

    A,B,D,C,E
    
  • After all headers have been analysed, it reads all files completely and rewrites the full file with empty fields where needed.

This script makes use of What's the most robust way to efficiently parse CSV using awk?

Create a file merge_csv.awk with the following content:

BEGIN {
   OFS=","
   FPAT="[^,]*|\042[^\042]+\042"
   # keep track of the original argument count
   argc_start=ARGC
}

# Read header and process
# header names are stored as array index in the array "header"
# header order is stored in the array header_order
#    header_order[field_index] = header_name
(FNR == 1) && (ARGIND < argc_start) {
    for(i=1;i<=NF;++i) if (!($i in header)) { header[$i]; header_order[++nf_out]=$i } 
    # add file to end of argument list to be reprocessed
    ARGV[ARGC++] = FILENAME
    # process the next file
    nextfile
}

# Print headers in output file
(FNR == 1) && (ARGIND == argc_start) {
    for(i=1;i<=nf_out;++i) printf header_order[i] (i==nf_out ? ORS : OFS)
}

# Use array h to keep track of the column_name and corresponding field_index
# h[column_name] = field_index
(FNR == 1) { delete h; for(i=1;i<=NF;++i) h[$i]=i; next }

# print record
{
    # process all fields
    for(i=1;i<=nf_out;++i) {
        # get field index using h
        j = h[header_order[i]]+0
        # if field index is zero, print empty field
        printf (j == 0 ? "" : $j) (i==nf_out ? ORS : OFS)
    }
}

Now you can run the script as

$ awk -f merge_csv.awk *.csv > output.csv

This will not work with a huge number of CSV files. This can be solved in the following way. Assume you have a file filelist.txt containing all files you want (could be generated by find), then addopt the above script as:

BEGIN {
   OFS=","
   FPAT="[^,]*|\042[^\042]+\042"
}

# Read original filelist, and build argument list
(FNR == NR) { ARGV[ARGC++]=$0; argc_start=ARGC; next }

# Read header and process
# header names are stored as array index in the array "header"
# header order is stored in the array header_order
#    header_order[field_index] = header_name
(FNR == 1) && (ARGIND < argc_start) {
    for(i=1;i<=NF;++i) if (!($i in header)) { header[$i]; header_order[++nf_out]=$i } 
    # add file to end of argument list to be reprocessed
    ARGV[ARGC++] = FILENAME
    # process the next file
    nextfile
}

# Print headers in output file
(FNR == 1) && (ARGIND == argc_start) {
    for(i=1;i<=nf_out;++i) printf header_order[i] (i==nf_out ? ORS : OFS)
}

# Use array h to keep track of the column_name and corresponding field_index
# h[column_name] = field_index
(FNR == 1) { delete h; for(i=1;i<=NF;++i) h[$i]=i; next }

# print record
{
    # process all fields
    for(i=1;i<=nf_out;++i) {
        # get field index using h
        j = h[header_order[i]]+0
        # if field index is zero, print empty field
        printf (j == 0 ? "" : $j) (i==nf_out ? ORS : OFS)
    }
}

Now you can run the code as:

$ awk -f merge_csv.awk filelist.txt

If your filelist is really too big, you might want to use split and use a loop to create various temporary CSV files which can be merged again in a second or even third run.

0
votes
  • Using dir with a pattern to select filenames;
  • Adding a source-file column, will be useful down the road;
  • Simpler for loop call;
  • Forcing all columns to character, safest option when reading multiple files, readr parse guess function will abort if you run into field mismatches.

Note: a 16 file test run consistently crashes my computer somewhere between a 15MB 771 columns census.csv and a 180MB 1.6M row beer_reviews.csv.

library(readr)
library(dplyr)

setwd("/home/username/R/csv_test")

csvfiles <- dir(pattern = "\\.csv$")

csvdata  <- tibble(filename=c("Source File"))

for (i in csvfiles) {
  tmpfile <- read_csv(i, col_types = cols(.default = "c"))
  tmpfile$filename <- i
  csvdata <- bind_rows(csvdata, tmpfile)
}
csvdata
# A tibble: 1,622,379 x 874

...

Timed 10 file test run with a total of 20k rows and 100 columns. In R:

 user  system elapsed 
0.678   0.008   0.685 

And the Awk script on this page:

real    0m2.202s
user    0m2.175s
sys     0m0.025s