Einsum slower than explicit Numpy implementation for n-mode tensor-matrix product

Question

I'm trying to implement the n-mode tensor-matrix product (as defined by Kolda and Bader: https://www.sandia.gov/~tgkolda/pubs/pubfiles/SAND2007-6702.pdf) efficiently in Python using Numpy. The operation effectively gets down to (for matrix U, tensor X and axis/mode k):

Extract all vectors along axis k from X by collapsing all other axes.
Multiply these vectors on the left by U using standard matrix multiplication.
Insert the vectors again into the output tensor using the same shape, apart from X.shape[k], which is now equal to U.shape[0] (initially, X.shape[k] must be equal to U.shape[1], as a result of the matrix multiplication).

I've been using an explicit implementation for a while which performs all these steps separately:

Transpose the tensor to bring axis k to the front (in my full code I added an exception in case k == X.ndim - 1, in which case it's faster to leave it there and transpose all future operations, or at least in my application, but that's not relevant here).
Reshape the tensor to collapse all other axes.
Calculate the matrix multiplication.
Reshape the tensor to reconstruct all other axes.
Transpose the tensor back into the original order.

I would think this implementation creates a lot of unnecessary (big) arrays, so once I discovered np.einsum I thought this would speed things up considerably. However using the code below I got worse results:

import numpy as np
from time import time

def mode_k_product(U, X, mode):
    transposition_order = list(range(X.ndim))
    transposition_order[mode] = 0
    transposition_order[0] = mode
    Y = np.transpose(X, transposition_order)
    transposed_ranks = list(Y.shape)
    Y = np.reshape(Y, (Y.shape[0], -1))
    Y = U @ Y
    transposed_ranks[0] = Y.shape[0]
    Y = np.reshape(Y, transposed_ranks)
    Y = np.transpose(Y, transposition_order)
    return Y

def einsum_product(U, X, mode):
    axes1 = list(range(X.ndim))
    axes1[mode] = X.ndim + 1
    axes2 = list(range(X.ndim))
    axes2[mode] = X.ndim
    return np.einsum(U, [X.ndim, X.ndim + 1], X, axes1, axes2, optimize=True)

def test_correctness():
    A = np.random.rand(3, 4, 5)
    for i in range(3):
        B = np.random.rand(6, A.shape[i])
        X = mode_k_product(B, A, i)
        Y = einsum_product(B, A, i)
        print(np.allclose(X, Y))

def test_time(method, amount):
    U = np.random.rand(256, 512)
    X = np.random.rand(512, 512, 256)
    start = time()
    for i in range(amount):
        method(U, X, 1)
    return (time() - start)/amount

def test_times():
    print("Explicit:", test_time(mode_k_product, 10))
    print("Einsum:", test_time(einsum_product, 10))

test_correctness()
test_times()

Timings for me:

Explicit: 3.9450525522232054

Einsum: 15.873924326896667

Is this normal or am I doing something wrong? I know there are circumstances where storing intermediate results can decrease complexity (e.g. chained matrix multiplication), however in this case I can't think of any calculations that are being repeated. Is matrix multiplication so optimized that it removes the benefits of not transposing (which technically has a lower complexity)?

Following are the timings on my system (3.6.5, NumPy 1.14.3): Explicit: 1.1669329166412354 and Einsum: 1.553536319732666 — Sheldore
My timings are: Explicit: 0.781634783744812 Einsum: 0.8676517248153687 on Python 3.5 and NumPy 1.16.1. what NumPy version are you using? — Rohola Zandie
In k_product the heavy lifting is in the @ product. The transpose and reshaping is virtually costless. Depending on the shapes einsum optimize might end up using matmul as well. So it's not surprising that the timings will be similar. — hpaulj
Thanks or your comments, looks like my software may be a bit outdated. I'm using Python 3.6.7 with Numpy 1.13.3. I'll see if I can update my Numpy installation. My computer is still drastically slower though, could this have something to do with hardware? I'm using an Intel(R) Core(TM) i7-4700HQ CPU @ 2.40GHz and both methods seem to use only one core. — Wout12345
The use of the optimize parameter in einsum has shifted quite a bit in recent numpy versions. So yes, it would be good to get a recent version, and test a couple of alternatives. — hpaulj

hpaulj hpaulj · Accepted Answer · 2019-04-19T22:36:24

I'm more familiar with the subscripts style of using einsum, so worked out these equivalences:

In [194]: np.allclose(np.einsum('ij,jkl->ikl',B0,A), einsum_product(B0,A,0))          
Out[194]: True
In [195]: np.allclose(np.einsum('ij,kjl->kil',B1,A), einsum_product(B1,A,1))          
Out[195]: True
In [196]: np.allclose(np.einsum('ij,klj->kli',B2,A), einsum_product(B2,A,2))          
Out[196]: True

With a mode parameter, your approach in einsum_product may be best. But the equivalences help me visualize the calculation better, and may help others.

Timings should basically be the same. There's an extra setup time in einsum_product that should disappear in larger dimensions.

Einsum slower than explicit Numpy implementation for n-mode tensor-matrix product

2 Answers