2
votes

I have a bit of a problem with the speed of my matlab calculation. I was able to write a code in matlab to run the calculation for small matrices, but it uses nested for-loops, and with the large datasets I'm using, matlab fails to finish the computation.

Note: I'm not terribly familiar with Matlab, so while the program works, it is extremely inefficient.

In short, I'm trying to create a matrix whose entries describe the relationship between a set of unique locations. As a concrete example, we start with this matrix:

B = 

5873 4 1

5873 7 1

5873 1 1

2819 8 2

2819 1 2

9771 4 3

9771 2 3

9771 5 3

9771 6 3

5548 7 4

Where the third column is the unique location identifier and the second column is the number of the "segment" that happens to be in the location. Notice that multiple segments can fall into different locations.

What I would like to is to create a matrix that describes the relationship between the different locations. Specifically, for location i & j, I would like the (i,j) entry of the new matrix to be the number of segments that i & j have in common divided by the total number of segments of i & j combined.

Currently my code looks like this:

C = zeros(max(B.data(:,3)), max(B.data(:,3)));

for i = 1:max(B.data(:,3))

  for j = 1:max(B.data(:,3))

    vi = B.data(:,3) == i;
    vj = B.data(:,3) == j;
    C(i,j) = numel(intersect(B.data(vi,2), B.data(vj,2))) / numel(union(B.data(vi,2), B.data(vj,2)));
  end

end

But it is very very slow. Does anyone have any suggestions for speeding up the calculation?

Thanks so much!!

2
What's the typical number of locations and of segments? - Luis Mendo
Can you also provide a sample output of what you expect given your above example? The input is well defined, but what does the output look like? This is mainly because I want to make sure I understand your problem before trying to solve it. - rayryeng
Hi Everyone, thank you so much for replying! What I am looking for is a matrix C = [1 .25 .16 .33 ; .25 1 0 0 ; .16 0 1 0 ; .33 0 0 1], which is a symmetric matrix with 1's on the diagonal. - user240913
The datasets that I'm working with a very large. The largest one has 6,418,784 rows with 24,814 unique locations, meaning that the resultant matrix would be of size 24814x24814. Thank you so much for the different approaches provided, and I'm not sure which one would be suitable for a dataset this large. - user240913

2 Answers

2
votes

Approach with grouping and loop

  1. Locations are grouped according to segments (cell array groups in the code below). accumarray with a custom function is used for the task.
  2. A square matrix C of size equal to the maximum location is initiallized to zero. For each group, all pairs of locations belonging to that group have their entry in C increased by 1.
  3. The matrix C is normalized.

Code:

groups = accumarray(B.data(:,2), B.data(:,3), [], @(x) {x});  %// step 1
C = zeros(max(B.data(:,3)));                                  %// step 2
for n = 1:numel(groups);
    ind = groups{n};
    C(ind,ind) = C(ind,ind)+1;
end
d = diag(C);                                                  %// step 3
C = C./(bsxfun(@plus,d,d.')-C);

Vectorized approach using a 3D array

This approach is memory-hungry; and according to @Divakar's comments below, for very large datasets it's not faster than the loop approach:

  1. It builds a 3D logical array T such that T(m,n,s) is 1 if and only if locations m and n share segment s. This is done efficiently with bsxfun.
  2. Summing T along the third dimension yields the same C as in the previous approach.
  3. The matrix C is normalized in the same way as previously.

Code:

T = full(sparse(B.data(:,3), B.data(:,2), 1));               %// step 1
T = bsxfun(@and, permute(T, [1 3 2]), permute(T, [3 1 2]));
C = sum(T, 3);                                               %// step 2
d = diag(C);                                                 %// step 3
C = C./(bsxfun(@plus,d,d.')-C);

Benchmarking

Credit goes to @Divakar for the following benchmarking and explanations:

Here are some runtimes comparing the loop-based approach against the latter vectorized one -

***** Parameters: No. of rows in B ~= 10000 and No. of groups = 10 *****
---------------------------------- With loopy approach
Elapsed time is 0.242635 seconds.
---------------------------------- With vectorized approach
Elapsed time is 0.022174 seconds.
 
 
***** Parameters: No. of rows in B ~= 10000 and No. of groups = 100 *****
---------------------------------- With loopy approach
Elapsed time is 0.318268 seconds.
---------------------------------- With vectorized approach
Elapsed time is 0.451242 seconds.
 
 
***** Parameters: No. of rows in B ~= 100000 and No. of groups = 100 *****
---------------------------------- With loopy approach
Elapsed time is 1.173182 seconds.
---------------------------------- With vectorized approach
Elapsed time is 0.464339 seconds.
 
 
***** Parameters: No. of rows in B ~= 100000 and No. of groups = 1000 *****
---------------------------------- With loopy approach
Elapsed time is 10.310780 seconds.
---------------------------------- With vectorized approach
Elapsed time is 54.216923 seconds.

One can notice that for the same number of rows in B, an increment in the number of groups means a considerable drop in performance for the vectorized approach.

So, this could be kept in mind when choosing the appropriate approach for a problem case.

1
votes

This could be an approach using one-loop -

%// Store column-2 and 3 values in separate variables for easy access
col2 = B(:,2);
col3 = B(:,3);

%// Start and end indices of the groups w.r.t column-3 of B
ends = [find(diff(col3)) ; numel(col3)];
starts = [1 ; ends(1:end-1)+1];

ngrp = max(col3); %// number of groups
intersectM = zeros(ngrp); %// initialize 2D matrix to store intersect counts
for k = 1:ngrp-1

    %// Matches for each group with respect to each other group 
    %// but one less after each iteration
    matches = ismember(col2(starts(k+1):end),col2(starts(k):ends(k)));
    %// OR matches =
    %// any(bsxfun(@eq,col2(starts(k):ends(k)),col2(starts(k+1):end).'),1)'

    %// Store intersect counts
    intersectM(k,k+1:end) = accumarray(col3(starts(k+1):end)-k,matches,[]);
end

grp_counts = histc(col3,1:ngrp); %// counts of group elements

%// Peform numel(intersect)/numel(union)
C = intersectM./(bsxfun(@plus,grp_counts,grp_counts') - intersectM); %//'

%// Since the output is diagonal symmetric, so just copy over the upper
%// diagonal elements into the lower diagonal places
C = triu(C)' + C; %//'
C(1:ngrp+1:end) = 1; %// Set diagonal elements as ones