3
votes

In MATLAB I want to multiply an Nx4 matrix by a 4xN matrix and get an Nx1 vector out of it. I'm also dividing the result element-wise by another vector.

In a loop, it would be:

A=rand(10,4);
B=rand(4,10);
L=rand(10,1);
for i=1:10
    result(i)=A(i,:)*B(:,i)/L(i);
end

The only non-loop method I can think of to get this done is:

A=rand(10,4);
B=rand(4,10);
L=rand(10,1);
result=diag(A*B)./L

But that does a lot of unnecessary multiplication. Is there a better way?

1

1 Answers

7
votes

Vectorized Approach

You can do matrix multiplication between A and transpose of B, then sum along dim-2 and finally perform elementwise division with L -

result = sum(A.*B.',2)./L

Benchmarking

This section covers runtime and speedup tests for the proposed approach against the loop-based approach as listed in the early part of the question. Please note that the other diag based approach as suggested in the second part of the question wasn't covered in these tests as the runtimes with it were comparatively very big numbers.

Benchmarking Code

N_arr = 4000:4000:100000;        %// Datasizes
timeall = zeros(2,numel(N_arr)); %// Array to store runtimes
for iter = 1:numel(N_arr)

    %// Create random inputs
    N = N_arr(iter);
    A=rand(N,4);    B=rand(4,N);    L=rand(N,1);

    %// Time the approaches
    f = @() mult_twomat_loopy(A,B,L);
    timeall(1,iter) = timeit(f); clear f
    f = @() mult_twomat_vect(A,B,L);
    timeall(2,iter) = timeit(f); clear f
end

%// Plot speedups
figure,hold on,grid on
plot(N_arr,timeall(1,:)./timeall(2,:),'-bo')
xlabel('Datasize, N(Rows in A)'),ylabel('Speedup Factor (x)')
title('Speedup Plot')

Associated function codes

mult_twomat_loopy.m:

function result = mult_twomat_loopy(A,B,L)
N = size(A,1);
result = zeros(N,1);
for i=1:N
    result(i)=A(i,:)*B(:,i)/L(i);
end

mult_twomat_vect.m:

function result = mult_twomat_vect(A,B,L)
result = sum(A.*B.',2)./L;

Speedups for the proposed approach over loopy one

enter image description here

Conclusions

As can be seen from the speedup plot, the proposed approach seems like a very good bet to solve the problem. Interesting observation from the plot again is the sudden dip in performance for the proposed approach against the loopy approach for 32K+ datasizes. The reason behind this dip most likely seems to be the system memory bandwidth pegging back the performance at such large datasizes, but thankfully even for such datasizes the speedup is still fairly maintained at over 20x mark.