3
votes

I try to capture an HTTP-download with Python using dpkt and pcap. The code looks like

...
pc = pcap.pcap(iface)
for ts, pkt in pc:
    handle_packet(pkt)

def handle_packet(pkt):
    eth = dpkt.ethernet.Ethernet(pkt)

    # Ignore non-IP and non-TCP packets
    if eth.type != dpkt.ethernet.ETH_TYPE_IP:
        return
    ip = eth.data
    if ip.p != dpkt.ip.IP_PROTO_TCP:
        return

    tcp = ip.data
    data = tcp.data

    # current connection
    c = (ip.src, ip.dst, tcp.sport, tcp.dport)

    # Handle only new HTTP-responses and TCP-packets
    # of existing connections.
    if c in conn:
        handle_tcp_packet(c, tcp)
    elif data[:4] == 'HTTP':
        handle_http_response(c, tcp)
...

In handle_http_response() and handle_tcp_packet() i read the data of the tcp-packets (tcp.data) and write them to a file. However i noticed that i often get packets with the same TCP sequence number (tcp.seq) (on the same connection) but it seems that they contain the same data. Moreover it seems that not all packets are captured. For example if i sum up the packet-sizes the resulting value is lower than the one listed in the http-header (content-length). But in Wireshark i can see all packages.

Does anyone has an idea why i get those duplicate packets and how i can capture every packet belonging to the http-response?

EDIT:
Here you can find the complete code: pastebin.com. When running it prints something like that to stdout:

Waiting for HTTP-Audio-responses ...
...
New TCP-Packet, len=1440, tcp-payload=5107680, con-len=5197150 , dups=57 , dup-bytes=82080
New TCP-Packet, len=1440, tcp-payload=5109120, con-len=5197150 , dups=57 , dup-bytes=82080
New TCP-Packet, len=1440, tcp-payload=5110560, con-len=5197150 , dups=57 , dup-bytes=82080
----------> FIN <----------
New TCP-Packet, len=1937, tcp-payload=5112497, con-len=5197150 , dups=57 , dup-bytes=82080
New TCP-Packet, len=0, tcp-payload=5112497, con-len=5197150 , dups=57 , dup-bytes=82080

As you can see the TCP-payload plus the duplicate received bytes (5112497+82080=5194577) are lower than the filesize of the download (5197150). Moreover you can see that i receive 57 duplicate packages (same SEQ and same TCP-data) and that still packages are received after the packet with the FIN-flag.

So does anyone have an idea how i can capture all packets belonging to the connection? Wireshark sees all packets and i think it uses libpcap too.

I don't even know if i do something wrong or if the pcap-library does something wrong.

EDIT2:
OK, it seems that my code is correct: In Wireshark I saved the captured packets and used the capture-file in my code (pcap.pcap('/home/path/filename') instead of pcap.pcap('eth0')). My code read perfectly all packages (on multiple tests)! Since Wireshark uses libpcap too (afaik), i think the problem is the lib pypcap which does not provide me all packages.

Any idea on how to test that?

I already compiled pypcap by myself (trunk) but that didn't change anything -.-

EDIT3:
OK, I changed my code to work with pcapy instead of pypcap and have the same problem:
When reading the packets from a previous captured file (created with Wireshark) then everything is fine, but when I capture the packets directly from eth0 I miss some packets.

Interesting: When running both programs (the one using pypcap and the one using pcapy) in parallel they capture different packets. e.g. one programm receives one packet more.

But I have still no idea why -.-
I thought Wireshark uses the same base-lib (libpcap).

Please help :)

2
are you missing entire packets, or cutting packets short? pcap has (used to have?) a small buffer by default, so you don't (didn't?) always get all the data for each packet.andrew cooke
That's an interesting question :) In Wireshark each TCP-packet has 1440 bytes data. The packets from pcap have 1440 byte data, too. The content-length of the download is 5197150. The sum of the TCP-packet-length is 5152510 (except duplicate packets with same SEQ as previous packets and without HTTP-header-information). The difference (5197150-5152510=44640) is (always) a multiple of 1440. So i think i miss entire packets, right?Biggie

2 Answers

1
votes

Here's a couple of things to watch out for:

  • make sure you have a big snaplen - for pcapy you can set it on open_live (second parameter)
  • make sure you handle fragmented packets - this will not be done automatically - you need to check the details
  • check statistics - unfortunately I don't think this is exposed to pcapy interface, but it's possible that you're not handling all packets; if you're too late you will not know that you missed something (although you can get the same information by tracking the length / position of tcp stream) libpcap itself does expose those statistics, so you might be able to add the function for it
0
votes

Set the snaplen to 65535. Apparently this is the default for Wireshark: http://www.wireshark.org/docs/wsug_html_chunked/ChCustCommandLine.html