Proposed solution
I was able to finally crack it for a truly vectorized solution that uses logical indexing to select the elements from the input matrix that are to be summed up. This magic was achieved with bsxfun with its optional functional handle @mod. The code is listed below -
[m,n] = size(M);
mask = bsxfun(@mod,1:n,(1:m)')==1; %//'# all of magic happens here as it creates
%// a logical mask of 1's at places in input matrix
%// whose elements are to be summed and 0's elsewhere.
mask(1,:) = 1; %// set the first row as all ones as we need to sum all of those
sumvals = sum(mask.*M,2); %// finally get the sum values
Benchmarking
In this benchmarking section I am including the four approaches - The one listed earlier in this post and its GPU ported version,
arrayfun and sparse based approaches listed in the other solution.
Three sets of input data were used for the benchmarking -
Set 1: Number of columns is kept at a multiple factor of 10 with respect to the number of rows in the input matrix as used in the question.
Set 2: The number of rows are extended so that number of rows is now 10x number of columns. This would really test out the loopy codes, as the
number of iterations would be more in this case.
Set 3: This set is an extension of set2, to further increase the number of rows and thus would be another great test between truly vectorized approaches
against others.
The function codes used for benchmarking are listed next -
function sumvals = sumrows_stepped_bsxfun(M)
//... same as the code posted earlier
return
function sumvals = sumrows_stepped_bsxfun_gpu(M)
gM = gpuArray(M);
[m,n] = size(gM);
mask = bsxfun(@mod,gpuArray.colon(1,n),gpuArray.colon(1,m)')==1; %//'
sumvals = gather(sum(mask.*gM,2));
sumvals(1) = sum(M(1,:));
return
function S = sumrows_stepped_arrayfun(M)
[m,n] = size(M);
S = arrayfun(@(x) sum(M(x,1:x:n)), 1:m);
return
function B = sumrows_stepped_sparse(M)
sz = size(M);
A=sparse(sz(1),sz(2));
for n=1:sz(1),
A(n, 1:n:end)=1;
end
B=full(sum(M.*A,2));
return
Please note that timeit was used for timing CPU based codes and gputimeit for GPU based codes.
System Configuration used for testing -
MATLAB Version: 8.3.0.532 (R2014a)
Operating System: Windows 7
RAM: 3GB
CPU Model: IntelĀ® PentiumĀ® Processor E5400 (2M Cache, 2.70 GHz)
GPU Model: GTX 750Ti 2GB
The benchmarking results thus obtained -



Conclusions
For datasizes with number of rows lesser than the number of columns, the number of iterations being a small number, loopy codes seem to have the upper-hand.
As we increase the number of rows, the benefit of truly vectorized approaches become clear. You would also notice that bsxfun on CPU based approach does well for set 3, until around 12000 x 300 mark against non vectorized approaches and the reason behind that is, bsxfun creates this huge logical mask and at that point the memory bandwidth requirement becomes too high to cope up with the compute capability of bsxfun. This makes sense as vectorized operations by definition mean performing operations on many elements in one go, so memory bandwidth is essential. So, if you have a better machine with more RAM, that 12000 x 300 mark should extend further.
If you could extend the number of rows further, the benefits of a vectorized solution would become more clear as long as the memory bandwidth is kept in check.
Benchmarking Code
Finally here's the benchmarking code if anyone would like to test it out on their systems -
clear all; clc; close all
outputfile = 'results.xlsx';
delete(outputfile); %// remove file, so that new results could be written into
base_datasize_array = 40:60:400;
methods = {'BSXFUN on GPU','BSXFUN on CPU','ARRAYFUN','SPARSE'};
num_approaches = numel(methods);
num_sets = 3;
timeall_all = zeros(num_approaches,numel(base_datasize_array),num_sets);
datasize_lbs = cell(numel(base_datasize_array),num_sets);
for set_id = 1:num_sets
switch set_id
case 1
N1_arr = base_datasize_array*2;
N2_arr = N1_arr*10;
case 2
N2_arr = base_datasize_array*2;
N1_arr = N2_arr*10;
case 3
N2_arr = base_datasize_array;
N1_arr = N2_arr*40;
end
timeall = zeros(num_approaches,numel(N1_arr));
for iter = 1:numel(N1_arr)
M = rand(N1_arr(iter),N2_arr(iter));
f = @() sumrows_stepped_bsxfun_gpu(M);
timeall(1,iter) = gputimeit(f); clear f
f = @() sumrows_stepped_bsxfun(M);
timeall(2,iter) = timeit(f); clear f
f = @() sumrows_stepped_arrayfun(M);
timeall(3,iter) = timeit(f); clear f
f = @() sumrows_stepped_sparse(M);
timeall(4,iter) = timeit(f); clear f
end
timeall_all(:,:,set_id) = timeall;
wp = repmat({' '},numel(N1_arr),1);
datasize_lbs(:,set_id) = strcat(cellstr(num2str(N1_arr.')),' x ',...
wp,cellstr(num2str(N2_arr.')));
end
for set_id=1:num_sets
out_cellarr = cell(numel(methods)+1,numel(N1_arr)+1);
out_cellarr(1,1) = {'Methods'};
out_cellarr(2:end,1) = methods;
out_cellarr(1,2:end) = datasize_lbs(:,set_id);
out_cellarr(2:end,2:end) = cellfun(@(x) num2str(x),...
num2cell(timeall_all(:,:,set_id)),'Uni',0);
xlswrite(outputfile, out_cellarr,set_id);
end