Why is my awk script much slower than the head+tail script?

Question

I want to split a huge file (big.txt). by given line numbers. For example, if the give numbers are 10 15 30, I will get 4 files: 1-10, 11-15, 16-30, and 30 - EOF of the big.txt.

Solving the problem is not a challenge for me, I wrote 3 different solutions. However, I cannot explain the performance. Why the awk script is the slowest. (GNU Awk)

For the big.txt, I just did seq 1.5billion > big.txt (~15Gb)

first, the head and tail:

INPUT_FILE="big.txt"  # The input file
LINE_NUMBERS=( 400000 700000 1200000 ) # Given line numbers
START=0                 # The offset to calculate lines
IDX=1                   # The index used in the name of generated files: file1, file2 ...

for i in "${LINE_NUMBERS[@]}"
do
    # Extract the lines
    head -n $i "$INPUT_FILE" | tail -n +$START > "file$IDX.txt"
    #
    (( IDX++ ))
    START=$(( i+1 ))
done

# Extract the last given line - last line in the file
tail -n +$START "$INPUT_FILE" > "file$IDX.txt"

The 2nd: sed:

INPUT_FILE="big.txt"  # The input file
LINE_NUMBERS=( 400000 700000 1200000 ) # Given line numbers
START=1                 # The offset to calculate lines
IDX=1                   # The index used in the name of generated files: file1, file2 ...

for i in "${LINE_NUMBERS[@]}"
do
    T=$(( i+1 ))
    # Extract the lines using sed command
    sed -n -e " $START, $i p" -e "$T q" "$INPUT_FILE" > "file$IDX.txt"
    (( IDX++ ))
    START=$T
done

# Extract the last given line - last line in the file
sed -n "$START, $ p" "$INPUT_FILE" > "file$IDX.txt"

the last, awk

awk -v nums="400000 700000 1200000" 'BEGIN{c=split(nums,a)} {
    for(i=1; i<=c; i++){
        if( NR<=a[i] ){
            print > "file" i ".txt"
            next
        }
    }
    print > "file" c+1 ".txt"
}' big.txt

From my testing (using time command), the head+tail is the fastest:

real 73.48
user 1.42
sys 17.62

the sed one:

real 144.75
user 105.68
sys 15.58

the awk one:

real 234.21
user 187.92
sys 3.98

The awk went through the file only once, why it is much slower than the other two? Also, I thought the tail and head would be the slowest solution, how come it's so fast? I guess it might be something to do with the awk's redirection? (print > file)

Can someone explain it to me? Thank you.

Heh. When I saw this question, I assumed that it would be asked by someone who didn't know that tail could seek(); it's rather considerably more interesting than that. :) — Charles Duffy
BTW, which specific awk implementation are you using? (gawk, at least, caches and reuses file descriptors for redirections; but for others, it might be worth checking). — Charles Duffy
There's a bit of similar discussion on this here: unix.stackexchange.com/questions/47407/… — Sundeep
@CharlesDuffy it's gawk. I've updated the question to include the information. — Kent

thanasisp thanasisp · Accepted Answer · 2020-09-07T00:55:20

Can awk be faster than head and tail for this?

No, it will be slower, at least for a reasonable number of chunks for a large input file. Because it will read every line and do some work with it. On the other hand, head and tail will read massively the newline characters, without doing anything, will seek until they find the line number provided by the argument. Then they don't have again to read line by line and decide what to do, but dump the content, similar to cat.

If we increase the number of chunks, if the array of splitting line numbers is getting larger and larger, then we will reach a point where the cost of calling many head and tail processes will overcome the cost of one awk process, and from that point after, awk would be faster.

awk script improvement

This awk is slow because of that loop! Just think that for the last output file, for every line to print, we run 4 iterations until we print the line. Of course the time complexity still remains linear to the input, but all these checks and assignments have costs that can be observed as input grows. It can be much improved, e.g. like this:

> cat tst.awk
BEGIN { 
    a[1]
    a[40000]
    a[70000]
    a[120000]
}

NR in a {
    close(out)
    out = "file" ++i ".txt"
}

{ print > out }

Here we test only NR per line, actually we almost only print.

awk -f tst.awk big.txt

Testing

Here is some basic testing, I did a file, not huge, 5.2M lines.

> wc -l big.txt 
5288558 big.txt

Now, with that loop, it really matters where you split the file! If you have to write most of the rows into the last chunks, that means more iterations, it is slower

> head -1 test.sh
awk -v nums="100000 200000 300000" 'BEGIN{c=split(nums,a)} {
> time sh test.sh

real    0m10.960s
user    0m10.823s
sys     0m0.066s

If most rows goes to first file (that means one iteration and next) it becomes faster!

> head -1 test.sh
awk -v nums="5000000 5100000 5200000" 'BEGIN{c=split(nums,a)} {
> time sh test.sh

real    0m6.914s
user    0m6.838s
sys     0m0.043s

With the above modification it should be fast enough regardless the cut points.

> time awk -f tst.awk big.txt 

real    0m4.270s
user    0m4.185s
sys     0m0.048s

Why is my awk script much slower than the head+tail script?

first, the head and tail:

The 2nd: sed:

the last, awk

2 Answers