cython memoryview slower than expected

Question

I've started using memoryviews in cython to access numpy arrays. One of the various advantages they have is that they are considerably faster than the old numpy buffer support: http://docs.cython.org/src/userguide/memoryviews.html#comparison-to-the-old-buffer-support

However, I have an example where the old numpy buffer support is faster than memoryviews! How can this be?! I wonder if I'm using memoryviews correctly?

This is my test:

import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef np.ndarray[np.uint8_t, ndim=2] image_box1(np.ndarray[np.uint8_t, ndim=2] im, 
                                               np.ndarray[np.float64_t, ndim=1] pd,  
                                               int box_half_size):
    cdef unsigned int p0 = <int>(pd[0] + 0.5)  
    cdef unsigned int p1 = <int>(pd[1] + 0.5)    
    cdef unsigned int top = p1 - box_half_size
    cdef unsigned int left = p0 - box_half_size
    cdef unsigned int bottom = p1 + box_half_size
    cdef unsigned int right = p0 + box_half_size    
    cdef np.ndarray[np.uint8_t, ndim=2] box = im[top:bottom, left:right] 
    return box 

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef np.uint8_t[:, ::1] image_box2(np.uint8_t[:, ::1] im, 
                                    np.float64_t[:] pd,  
                                    int box_half_size):

    cdef unsigned int p0 = <int>(pd[0] + 0.5)  
    cdef unsigned int p1 = <int>(pd[1] + 0.5)    
    cdef unsigned int top = p1 - box_half_size
    cdef unsigned int left = p0 - box_half_size
    cdef unsigned int bottom = p1 + box_half_size
    cdef unsigned int right = p0 + box_half_size     
    cdef np.uint8_t[:, ::1] box = im[top:bottom, left:right]   
    return box

The timing results are:

image_box1: typed numpy: 100000 loops, best of 3: 11.2 us per loop

image_box2: memoryview: 100000 loops, best of 3: 18.1 us per loop

These measurements are done from IPython using %timeit image_box1(im, pd, box_half_size)

I guess you are timing these functions from python? Since then the return value is a np.ndarray also in the second function (I assume), which may already explain the slowdown, since making the np.ndarray is a bit of extra work and there is not much done here overall. — seberg
Yes, I timed these from IPython with the command: %timeit image_box1(im, pd, box_half_size) I have just edited my question to include timing from within cython. memoryviews are still slower! — martinako
correction! you are right, the delay is in the conversion from numpy array to memoryview! — martinako

martinako martinako · Accepted Answer · 2012-10-09T15:22:05

Alright! I found the problem. As seberg pointed out the memoryviews appeared slower because the measurement included the automatic conversion from numpy array to memoryview.

I used the following function to measure the times from within the cython module:

def test(params):   
    import timeit
    im = params[0]
    pd = params[1]
    box_half_size = params[2]
    t1 = timeit.Timer(lambda: image_box1(im, pd, box_half_size))
    print 'image_box1: typed numpy:'
    print min(t1.repeat(3, 10))
    cdef np.uint8_t[:, ::1] im2 = im
    cdef np.float64_t[:] pd2 = pd
    t2 = timeit.Timer(lambda: image_box2(im2, pd2, box_half_size))
    print 'image_box2: memoryview:'
    print min(t2.repeat(3, 10))

result:

image_box1: typed numpy: 9.07607864065e-05

image_box2: memoryview: 5.81799904467e-05

So memoryviews are indeed faster!

Note that I converted im and pd to memoryviews before calling image_box2. If I don't do this step and I pass im and pd directly, then image_box2 is slower:

image_box1: typed numpy: 9.12262257771e-05

image_box2: memoryview: 0.000185245087778

cython memoryview slower than expected

1 Answers