I want to split a huge file (big.txt). by given line numbers. For example, if the give numbers are 10 15 30, I will get 4 files: 1-10, 11-15, 16-30, and 30 - EOF of the big.txt.
Solving the problem is not a challenge for me, I wrote 3 different solutions. However, I cannot explain the performance. Why the awk script is the slowest. (GNU Awk)
For the big.txt, I just did seq 1.5billion > big.txt (~15Gb)
first, the head and tail:
INPUT_FILE="big.txt" # The input file
LINE_NUMBERS=( 400000 700000 1200000 ) # Given line numbers
START=0 # The offset to calculate lines
IDX=1 # The index used in the name of generated files: file1, file2 ...
for i in "${LINE_NUMBERS[@]}"
do
# Extract the lines
head -n $i "$INPUT_FILE" | tail -n +$START > "file$IDX.txt"
#
(( IDX++ ))
START=$(( i+1 ))
done
# Extract the last given line - last line in the file
tail -n +$START "$INPUT_FILE" > "file$IDX.txt"
The 2nd: sed:
INPUT_FILE="big.txt" # The input file
LINE_NUMBERS=( 400000 700000 1200000 ) # Given line numbers
START=1 # The offset to calculate lines
IDX=1 # The index used in the name of generated files: file1, file2 ...
for i in "${LINE_NUMBERS[@]}"
do
T=$(( i+1 ))
# Extract the lines using sed command
sed -n -e " $START, $i p" -e "$T q" "$INPUT_FILE" > "file$IDX.txt"
(( IDX++ ))
START=$T
done
# Extract the last given line - last line in the file
sed -n "$START, $ p" "$INPUT_FILE" > "file$IDX.txt"
the last, awk
awk -v nums="400000 700000 1200000" 'BEGIN{c=split(nums,a)} {
for(i=1; i<=c; i++){
if( NR<=a[i] ){
print > "file" i ".txt"
next
}
}
print > "file" c+1 ".txt"
}' big.txt
From my testing (using time command), the head+tail is the fastest:
real 73.48
user 1.42
sys 17.62
the sed one:
real 144.75
user 105.68
sys 15.58
the awk one:
real 234.21
user 187.92
sys 3.98
The awk went through the file only once, why it is much slower than the other two? Also, I thought the tail and head would be the slowest solution, how come it's so fast? I guess it might be something to do with the awk's redirection? (print > file)
Can someone explain it to me? Thank you.
tailcouldseek(); it's rather considerably more interesting than that. :) - Charles Duffysplitcommand, right? - tripleee