I am trying to split large pcap files containing hundreds of TCP streams into separate files. My current approach (see below) seems quite inefficient to me. My question is: What is the most efficient way of splitting pcap files into separate files by TCP stream?
Current approach
In my current approach, I first use tshark to find out which TCP streams are in the file. Next, for each of these TCP streams, I read the original file and extract the given stream. The code snippet below shows my approach:
#!/bin/bash
# Get all TCP stream numbers
for stream in `tshark -r $file -T fields -e tcp.stream | sort -n | uniq`
do
# Extract specified stream from $file and write it to a separate file.
tshark -r "$file" -Y "tcp.stream eq $stream" -w "$file.$stream.pcap"
done
However, this approach seems inefficient as tshark has to read the pcap file several times (once for each stream). I would ideally like a solution that goes over the original pcap file once and upon finding a packet belonging to a specific connection, append it to that file.
Other approaches
I have looked around for other approaches as well, but they do not seem to suit my situation:
- PcapPlusPlus' PcapSplitter has a slightly different definition of a TCP connection. They define 'connection' as the same (protocol, source ip, destination ip, source port, destination port)-tuple, which might show weird behaviour if multiple TCP streams have the same tuple. I believe wireshark/tshark actually base their TCP streams on the SYN:SYN-ACK and FIN:FIN-ACK flags (but please correct me if I am wrong).
- Python's Scapy Scapy has the same problem as PcapSplitter in that it does not provide any way of splitting TCP streams apart from the 5-tuple described above. (Of course I could write this myself, but that would be beyond the scope of my current work).
Also for both of these solutions, I am not entirely sure whether they are able to correctly handle erroneous captures.
Question
Therefore, I would like to have some suggestions on how to split pcap files into separate files based on TCP stream in the most efficient way.