2
votes

I'm working on the ISCXVPN2016 dataset, it consists of some pcap files (each pcap is captured traffic of a specific app such as skype, youtube, etc.) and I have converted them to pickle files and then write them into a text file using code below:

pkl = open("AIMchat2.pcapng.pickle", "rb")
with open('file.txt', 'w') as f:
    for Item in pkl:
        f.write('%s\n' %Item)

file.txt:

b'\x80\x03]q\x00(cnumpy.core.multiarray\n' b'_reconstruct\n' b'q\x01cnumpy\n' b'ndarray\n' b'q\x02K\x00\x85q\x03C\x01bq\x04\x87q\x05Rq\x06(K\x01K\x9d\x85q\x07cnumpy\n' b'dtype\n' b'q\x08X\x02\x00\x00\x00u1q\tK\x00K\x01\x87q\n' b'Rq\x0b(K\x03X\x01\x00\x00\x00|q\x0cNNNJ\xff\xff\xff\xffJ\xff\xff\xff\xffK\x00tq\rb\x89C\x9dE\x00\x00\x9dU\xbc@\x00\x80\x06\xd7\xc9\x83\xca\xf0W@\x0c\x18\xa74I\x01\xbb\t].\xc8\xf3*\xc51P\x18\xfa[)j\x00\x00\x17\x03\x02\x00p\x14\x90\xccY|\xa3\x7f\xd1\x12\xe2\xb4.U9)\xf20\xf1{\xbd\x1d\xa3W\x0c\x19\xc2\xf0\x8c\x0b\x8c\x86\x16\x99\xd8:\x19\xb0G\xe7\xb2\xf4\x9d\x82\x8e&a\x04\xf2\xa2\x8e\xce\xa4b\xcc\xfb\xe4\xd0\xde\x89eUU]\x1e\xfeF\x9bv\x88\xf4\xf3\xdc\x8f\xde\xa6Kk1q`\x94]\x13\xd7|\xa3\x16\xce\xcc\x1b\xa7\x10\xc5\xbd\x00\xe8M\x8b\x05v\x95\xa3\x8c\xd0\x83\xc1\xf1\x12\xee\x9f\xefmq\x0etq\x0fbh\x01h\x02K\x00\x85q\x10h\x04\x87q\x11Rq\x12(K\x01K.\x85q\x13h\x0b\x89C.E\x00\x00

My question is how I can compute the entropy of each pickle file?

(I have updated the question)

3
please define entropyMarat
If you need a rigorous process and determined value please comment.noobmaster69
@Marat Entropy is a measure of randomness of data. But if you mean which kind of entropy there are some ways, for now I can simply use Shannon Entropy.Nebula
@ventaquil Actually I saw that, but couldn't write the python code, I'm kinda new to python.Nebula

3 Answers

2
votes

If I do nothing wrong this is the answer (based on How to calculate the entropy of a file? and Shannon entropy).

#!/usr/bin/env python3

import math


filename = "random_data.bin"

with open(filename, "rb") as file:
    counters = {byte: 0 for byte in range(2 ** 8)}  # start all counters with zeros

    for byte in file.read():  # read in chunks for large files
        counters[byte] += 1  # increase counter for specified byte

    filesize = file.tell()  # we can get file size by reading current position

    probabilities = [counter / filesize for counter in counters.values()]  # calculate probabilities for each byte

    entropy = -sum(probability * math.log2(probability) for probability in probabilities if probability > 0)  # final sum

    print(entropy)

Checked with ent program on Ubuntu 18.04 with Python 3.6.9:

$ dd if=/dev/urandom of=random_data.bin bs=1K count=16
16+0 records in
16+0 records out
16384 bytes (16 kB, 16 KiB) copied, 0.0012111 s, 13.5 MB/s
$ ent random_data.bin
Entropy = 7.988752 bits per byte.
...
$ ./calc_entropy.py
7.988751920202076

Tested with text file too.

$ ent calc_entropy.py
Entropy = 4.613356 bits per byte.
...
$ ./calc_entropy.py
4.613355601248316
2
votes

A naive solution would be gzip/tar the file. Determine the entropy with the calculation of (size-of-gzipped/tar-file)/(size-of-original) as measure of randomness.
This result isn't accurate as neither gzip nor tar is an "ideal" compressor but the result will be more accurate as the file size grows.
A good choice to use a written python code to check entropy would be this:
http://code.activestate.com/recipes/577476-shannon-entropy-calculation/#c3

1
votes

You could use BiEntropy, Trientropy or their addition TriBientropy to compute the entropy of your pickle files. The algorithms are described on www.arxiv.org and BiEntropy has been implemented with test harnesses on Github. BiEntropy has been tested positevely on large raw binary files