This is a challenging question which requires to consider speed as well as memory consumption.
If I understand correctly, the OP wants to merge millions of small csv files. According to the sample data, each file consists of just 2 rows: the header in the first row and character data in the second row. The number of columns as well as the column names may vary from file to file. However, all columns are of the same data type character.
Both OP's first attempt as well as M. Viking's answer are growing the result object iteratively. This is very inefficient as it requires to copy the same data over and over again. In addition, both are using read_csv()
from the readr
package which is not the fastest csv reader either.
To avoid growing the result object iteratively, all files are read into a list which is then combined using rbindlist()
in one go. The final result is then stored as csv file:
library(data.table)
file_names <- list.files(pattern = "file\\d+.csv")
result <- rbindlist(lapply(file_names, fread), use.names=TRUE, fill=TRUE)
fwrite(result, "result.csv")
From OP's expected result it seems that the columns should be ordered by column name. This can be achieved by
library(magrittr)
setcolorder(result, names(result) %>% sort())
which rearranges the columns of a data.table object by reference, i.e., without copying the whole object.
Performance
Now, let's look at processing time. For benchmarking, I have created 100k files (see Data section below) which is far from the volume the OP is aiming at but will allow to draw conclusions.
On my PC, the overall processing time was about 5 minutes:
bench::workout({
fn <- list.files(pattern = "file\\d+.csv")
tmp_list <- lapply(fn, function(x) fread(file = x, sep =",", header = TRUE, colClasses = "character") )
result <- rbindlist(tmp_list, use.names=TRUE, fill=TRUE)
setcolorder(result, names(result) %>% sort())
fwrite(result, "result.csv")
}, 1:5)
# A tibble: 5 x 3
exprs process real
<bch:expr> <bch:tm> <bch:tm>
1 1 562.5ms 577.19ms
2 2 1.81m 4.52m
3 3 14.05s 15.55s
4 4 15.62ms 175.1ms
5 5 2.2s 7.72s
Here, I have used the workout()
function from the bench
package to time the single expressions in order to identify the statement which takes the most time. The bulk takes reading in the csv files.
Also the object sizes are of importance. The combined data.table result
with 100k rows and 1000 columns occupies 800 MB, the temporary list only 11%. This is due to the many empty cells.
pryr::object_size(result)
800 MB
pryr::object_size(tmp_list)
87.3 MB
By the way, the result file "result.csv"
has a size of 98 MBytes on disk.
Conclusion
Computing time seems not be the main issue but the memory required to store the result.
If reading 100k files takes about 5 minutes I guess that reading 1M files may take about 50 minutes.
But, for 1M files with 3000 columns the resulting data.table may require 10 * 3 = 30 times more memory which is 24 GBytes. The temporary list may take only about 900 MBytes. So, it might be worthwhile to re-consider the resulting data structure.
Benchmarking different file readers
Above timings show that over 90% of the computing time is spent for reading the data files. Therefore, it is appropriate to benchmark different methods for reading CSV files:
read.csv()
from base R
read_csv()
from the readr
package
fread()
from the data.table
package
As a convenience to the user, all three functions offer to guess certain characteristics of the file such as field delimiters or data types. This may require additional computing time. Therefore, the functions are benchmarked also with explicitely stated file parameters.
For benchmarking, the bench
package is used because it measures also the allocated memory which might be the other limiting factor besides computing time. The benchmark is repeated for different numbers of files in order to study the effect on memory consumption.
library(data.table)
library(readr)
file_names <- list.files(pattern = "file\\d+.csv")
bm <- press(
n_files = c(1000, 2000, 5000, 10000),
{
fn <- file_names[seq_len(n_files)]
mark(
fread = lapply(fn, fread),
fread_p = lapply(fn, function(x) fread(file = x, sep =",", header = TRUE, colClasses = "character")),
# fread_pp = lapply(fn, fread, sep =",", header = TRUE, colClasses = "character"),
read.csv = lapply(fn, read.csv),
read.csv_p = lapply(fn, read.csv, colClasses = "character"),
read_csv = lapply(fn, read_csv),
read_csv_p = lapply(fn, read_csv, col_types = cols(.default = col_character())),
check = FALSE,
min_time = 10
)
}
)
The results are visualised by
library(ggplot2)
ggplot(bm) + aes(n_files, median, color = names(expression)) +
geom_point() + geom_line() + scale_x_log10()
ggsave("median.png")
ggplot(bm) + aes(n_files, mem_alloc, color = names(expression)) +
geom_point() + geom_line() + scale_x_log10()
ggsave("mem_alloc.png")
ggplot(bm) + aes(median, mem_alloc, color = names(expression)) + geom_point() +
facet_wrap(vars(n_files))
ggsave("mem_allov_vs_median.png")
When comparing median execution times we can observe (please, note the double logarithmic scale) that
- computing times are increasing almost linearly with the number of files;
- explicitely passing file parameters (timings are named
..._p
) always gives a performance gain vs guessing the parameters, in particular for read_csv()
;
- for reading many small files
read.csv()
is faster than fread()
while read_csv()
is slower by factors.
When comparing allocated memory we can observe (again, note the double logarithmic scale) that
- memory comsumption is increasing almost linearly with the number of files;
- explicitely passing file parameters (timings are named
..._p
) does not have a significant effect on memory allocation except for read_csv()
where guessing parameters seems to be rather costly;
fread()
allocates remarkably less memory than the two other other readers.
As mentioned above speed and memory consumption might be crucial here. Now, read.csv()
seems to be the best choice in terms of speed while fread()
is best in terms of memory comsumption as can be seen from the facetted scatter plot below.
My personal choice was to prefer fread()
(less memory consumption) over read.csv()
(faster) because RAM on my PC is limited and cannot be extended easily. Your mileage may vary.
Data
Below code was used to create 100k sample files:
library(magrittr) # piping used to improve readability
n_files <- 10^4L
max_header <- 10^3L
avg_cols <- 4L
headers <- sprintf("header%02i", seq_len(max_header))
set.seed(1L) # to ensure reproducible results
for (i in seq_len(n_files)) {
n_cols <- rpois(1L, avg_cols - 1L) + 1L # exclude 0
header <- sample(headers, n_cols)
file_name <- sprintf("file%i.csv", i)
sample(letters, n_cols, TRUE) %>%
as.list() %>%
as.data.frame() %>%
set_names(header) %>%
data.table::fwrite(file_name)
}
data.table::rbindlist()
– Rohit