Fast way to md5 a numpy array

Question

I am woring with a numpy's 1d array with thousands of uint64 numbers in python 2.7. What is the fastest way to calculate the md5 of every number individually?

Each number has to be converted to string before calling the md5 function. I read in many places that iterating over numpy's arrays and doing stuff in pure python is dead slow. Is there any way to circumvent that?

what's the point of this conversion? how md5 string can be used, that the original float64 can not? — lenik
I just want to convert the uint64 to strings and then get their MD5 as fast as possible. Gonna use those md5 strings later on. — Frederico Schardong
I'm pretty sure that @lenik is right and that you don't need this conversion. Converting before applying the MD5 seems to be an attempt of optimizing a code that is not even yet functional. Would you have a try applying lenik's suggestion? — Tim

Nils Werner Nils Werner · Accepted Answer · 2019-12-05T09:46:25

You can write a wrapper for OpenSSL's MD5() function that accepts NumPy arrays. Our baseline will be a pure Python implementation.

Create a wrapper using cffi:

import cffi

ffi = cffi.FFI()

header = r"""
void md5_array(uint64_t* buffer, int len, unsigned char* out);
"""

source = r"""
#include <stdint.h>
#include <openssl/md5.h>

void md5_array(uint64_t * buffer, int len, unsigned char * out) {
    int i = 0;
    for(i=0; i<len; i++) {
        MD5((const unsigned char *) &buffer[i], 8, out + i*16);
    }
}
"""

ffi.set_source("_md5", source, libraries=['ssl'])
ffi.cdef(header)

if __name__ == "__main__":
    ffi.compile()

and

import numpy as np
import _md5

def md5_array(data):
    out = np.zeros(data.shape, dtype='|S16')

    _md5.lib.md5_array(
        _md5.ffi.from_buffer(data),
        data.size,
        _md5.ffi.cast("unsigned char *", _md5.ffi.from_buffer(out))
    )
    return out

and compare the two:

import numpy as np
import hashlib

data = np.arange(16, dtype=np.uint64)
out = [hashlib.md5(i).digest() for i in data]

print(data)
# [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
print(out)
# [b'}\xea6+?\xac\x8e\x00\x95jIR\xa3\xd4\xf4t', ... , b'w)\r\xf2^\x84\x11w\xbb\xa1\x94\xc1\x8c8XS']

out = md5_array(data)

print(out)
# [b'}\xea6+?\xac\x8e\x00\x95jIR\xa3\xd4\xf4t', ... , b'w)\r\xf2^\x84\x11w\xbb\xa1\x94\xc1\x8c8XS']

For large arrays it's about 15x faster (I am a bit disappointed by that honestly...)

data = np.arange(100000, dtype=np.uint64)

%timeit [hashlib.md5(i).digest() for i in data]
169 ms ± 3.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit md5_array(data)
12.1 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Fast way to md5 a numpy array

3 Answers