Python bitmask (variable length)

votes

in order to solve one research problem we have to organise the bitmask search in python. As an input we have a raw data (we present it as a sequence of bits). Size is smth about 1,5Gb. as an output we have to get the number of occurrence of a specific bitmasks. Let me give an example to discribe the situation

input:    sequence of bits, a bitmask to search(mask length: 12bits)

the first idea (not efficient one) is to use XOR like this:

1step: from input we take 12 first bits(position 0 to 11) and make XOR with mask 
2step: from input we take bits from 1 to 12 position and XOR with mask ...

let us proceed 2 first steps:

input sequence 100100011110101010110110011010100101010110101010
mask to search: 100100011110
step 1: take first 12 bits from input: 100100011110 and XOR it with mask.
step 2: teke bits from 1 to 12position: 001000111101 and XOR it with mask.
...

the problem is: how to organise taking bits from an input ? we are able to take first 12 bits, but how can we take bits from 1 to 12 position those we need to proceed next iteration?

Before we use a python BitString package, but the time we spend to search all mask are to high. And one more. The size of a mask can be fron 12 bits up to 256. have any suggestions? task have to be implemented in python

pythonbitmask

What format is your input? Is it a list of bits [1, 0, 0, 1, 0, 0, 0, ...] or string "1001000..."? Or is it a list of (specified bit-width) integers? Also, if performance is a concern, you can make a pretty solid argument against Python to whoever is forcing you to use it. Otherwise, you must sacrifice performance for the benefits in expression you get. - Zooba

about input format: the whole input (1.5Gb file) is divided into blocks length 4096bytes (32768bits). Imaging that we are able to read raw data straight from a hard drive in blocks length 4096bytes. . So it is just a sequence of 1s and 0s (e.g., 0b011011010.). But we are able to present it as a vector of 1s and 0s or any other ways (if it helps us to search for a mask faster=)) - asssag

If we talk about time and performance - only in the context that we can not wait 100 years until the end calculations=) (we need these calculations only to test our theory. But it will be very good if we use the algorithm to search a bit mask with minimal algorithmic complexity) - asssag

Thanks for that. See my answer below. - Zooba

3 Answers

votes

Your algorithm is the naive way to search for "strings" in data, but luckily there are much better algorithms. One example is the KMP algorithm, but there are others that might fit your use case better.

With a better algorithm you can go from complexity of O(n*m) to O(n+m).

votes

Where your mask is a multiple of 8 bits, your search becomes a relatively trivial byte comparison and any substring search algorithm will suffice (I would not recommend converting to a string and using the built in search, since you will likely suffer from failed character validation issues.)

sequence = <list of 8-bit integers>
mask = [0b10010001, 0b01101101]
matches = my_substring_search(sequence, mask)

For a mask of greater than 8 bits but not a multiple of eight, I would suggest truncating the mask to a multiple of 8 and using the same substring search as above. Then for any matches found, you can test the remainder.

sequence = <list of 8-bit integers>
mask_a = [0b10010001]
mask_b = 0b01100000
mask_b_pattern = 0b11110000   # relevant bits of mask_b
matches = my_substring_search(sequence, mask_a)

for match in matches:
    if (sequence[match+len(mask_a)] & mask_b_pattern) == mask_b:
        valid_match = True  # or something more useful...

If sequence is a list of 4096 bytes, you may need to account for the overlap between sections. This can be easily done by making sequence a list of 4096+ceil(mask_bits/8.0) bytes, but still advancing by only 4096 each time you read the next block.

As a demonstration of generating and using these masks:

class Mask(object):
    def __init__(self, source, source_mask):
        self._masks = list(self._generate_masks(source, source_mask))

    def match(self, buffer, i, j):
        return any(m.match(buffer, i, j) for m in self._masks)

    class MaskBits(object):
        def __init__(self, pre, pre_mask, match_bytes, post, post_mask):
            self.match_bytes = match_bytes
            self.pre, self.pre_mask = pre, pre_mask
            self.post, self.post_mask = post, post_mask

        def __repr__(self):
            return '(%02x %02x) (%s) (%02x %02x)' % (self.pre, self.pre_mask,
                ', '.join('%02x' % m for m in self.match_bytes),
                self.post, self.post_mask)

        def match(self, buffer, i, j):
            return (buffer[i:j] == self.match_bytes and
                    buffer[i-1] & self.pre_mask == self.pre and
                    buffer[j] & self.post_mask == self.post)

    def _generate_masks(self, src, src_mask):
        pre_mask = 0
        pre = 0
        post_mask = 0
        post = 0
        while pre_mask != 0xFF:
            src_bytes = []
            for i in (24, 16, 8, 0):
                if (src_mask >> i) & 0xFF == 0xFF:
                    src_bytes.append((src >> i) & 0xFF)
                else:
                    post_mask = (src_mask >> i) & 0xFF
                    post = (src >> i) & 0xFF
                    break
            yield self.MaskBits(pre, pre_mask, src_bytes, post, post_mask)
            pre += pre
            pre_mask += pre_mask
            if src & 0x80000000: pre |= 0x00000001
            pre_mask |= 0x00000001
            src = (src & 0x7FFFFFFF) * 2
            src_mask = (src_mask & 0x7FFFFFFF) * 2

This code is not a complete search algorithm, it forms part of validating matches. The Mask object is constructed with a source value and a source mask, both left aligned and (in this implementation) 32-bits long:

src = 0b11101011011011010101001010100000
src_mask = 0b11111111111111111111111111100000

The buffer is a list of byte values:

buffer_1 = [0x7e, 0xb6, 0xd5, 0x2b, 0x88]

A Mask object generates an internal list of shifted masks:

>> m = Mask(src, src_mask)
>> m._masks
[(00 00) (eb, 6d, 52) (a0 e0),
 (01 01) (d6, da, a5) (40 c0),
 (03 03) (ad, b5, 4a) (80 80),
 (07 07) (5b, 6a, 95) (00 00),
 (0e 0f) (b6, d5) (2a fe),
 (1d 1f) (6d, aa) (54 fc),
 (3a 3f) (db, 54) (a8 f8),
 (75 7f) (b6, a9) (50 f0)]

The middle element is the exact match substring (there is no neat way to get this out of this object as is, but it's m._masks[i].match_bytes). Once you have used an efficient algorithm to find this subsequence, you can verify the surrounding bits using m.match(buffer, i, j), where i is the index of first matching byte and j is the index of the byte after the last matching byte (such that buffer[i:j] == match_bytes).

In buffer above, the bit sequence can be found starting at bit 5, which means that _masks[4].match_bytes can be found at buffer[1:3]. As a result:

>> m.match(buffer, 1, 3)
True

(Feel free to use, adapt, modify, sell or torture this code in any way possible. I quite enjoyed putting it together - an interesting problem - though I won't be held liable for any bugs, so make sure you test it thoroughly!)

votes

Searching for bit patterns within byte data is a little more challenging than typical searches. The usual algorithms don't always work well as there is a cost for extracting each bit from the byte data and there is only an 'alphabet' of two characters, so just by chance 50% of comparisons will match (this makes many algorithms much less efficient).

You mentioned trying the bitstring module (which I wrote) but that it was too slow. That's not too surprising to me so if anyone has any great ideas on how to do this I'm paying attention! But the way bitstring does it suggests a possible speed up for you:

To do the match bitstring converts chunks of the byte data into ordinary strings of '0's and '1', and then uses Python's find method to do a quick search. Lots of the time is spent converting the data to a string, but as you are searching in the same data multiple times there's a big saving to be had.

masks = ['0000101010100101', '010100011110110101101', '01010101101']
byte_data_chunk = bytearray('blahblahblah')
# convert to a string with one character per bit
# uses lookup table not given here!
s = ''.join(BYTE_TO_BITS[x] for x in byte_data_chunk)
for mask in masks:
    p = s.find(mask)
    # etc.

The point is that once you convert to an ordinary string you can use the built-in find, which is very well optimised, to search for each of your masks. When you used bitstring it had to do the conversion for every mask which would have killed performance.