2
votes

I have directory containing ~300K text files that I would like to concatenate into a single file, separating the contents of each file using a newline \n. For example

file1 = 'i like apples'
file2 = 'john likes oranges'
output = 'i like apples\njohn likes oranges'

The problem is that due to the large number of files, commands like

awk '{print}' dir/* combined.txt

throw an error about the list of arguments being too long. Any quick way to get around this issue? I have been trying to find a way to use piping but have been unsuccessful so far.

The text files do not end in a \n.

3
Does the order in which files are written to the combined file matter?Benjamin W.
Nope. So if you have a solution involving GNU parallel or something equivalent, that would be even better!Orest Xherija
I think parallel would be difficult when writing to a single file.Benjamin W.
@shellter this throws the Argument list too long error.Orest Xherija
@shellter ...and then you're at "why not use -exec instead".Benjamin W.

3 Answers

2
votes

To avoid the long command line, you can use a shell construct such as a for loop:

for f in dir/*; do cat "$f"; printf '\n'; done > combined.txt

If the order of files in the combined file doesn't matter, you can use find instead:

find dir -type f -exec sed -s '$s/$/\n/' {} + > combined.txt

This uses find -exec to minimize the number of times the command in -exec is called, while avoiding command lines that are too long.

sed -s '$s/$/\n' replaces the end of the last line in a file with a newline; -s makes sure that the change is applied to every file when multiple are supplied as arguments.

0
votes

One good way of working around a large list of files is using find, which is pretty standard on most distros these days. Something of the sort:

find ./dir -type f -exec bash -c "cat \{\} >> combined.txt && echo '' >> combined.txt"\;

I did not test it but this should work, and has the advantage of never building an argument list containing all the files in dir

0
votes

Solution with GNU Parallel:

printf '%s\0' * | parallel -0 'cat {}; echo' > combined.txt

Minor error: The combined.txt will end in \n which is not specified.

My guess is, however, that you will be I/O constrained, so Benjamin W.'s solution may be faster.