2
votes

I have two matrices. One is of size 1,000,000 x 9 and the other is 500,000 x 9.

The columns have the same meaning and the first 7 columns have the function of a key. Correspondingly, the last two columns have data character. There are many overlapping key values in both of the matrices and I would like to have a big matrix to compare the values. This big matrix should be of dimension 1,000,000 x 11.

For example:

A = [0 0 0 0 0 0 0 10 20; 0 0 0 0 0 0 1 30 40];
B = [0 0 0 0 0 0 0 50 60];

A merged matrix would look like this:

C = [0 0 0 0 0 0 0 10 20 50 60; 0 0 0 0 0 0 1 30 40 0 0];

As you can see, the first row of C has columns 8, 9 from matrix A and columns 10,11 from matrix B. The second row uses the columns 8, 9 from matrix A and 0,0 for the last to columns because there is no corresponding entry in matrix B.

I have accomplished this task theoretically, but it is very, very slow. I use loops a lot. In any other programming language, I would sort both tables, would iterate both of the tables in one big loop keeping two pointers.

Is there a more efficient algorithm available in Matlab using vectorization or at least a sufficiently efficient one that is idiomatic/short?

(Additional note: My largest issue seems to be the search function: Given my matrix, I would like to throw in one column vector 7x1, let's name it key to find the corresponding row. Right now, I use bsxfun for that:

targetRow = data( min(bsxfun(@eq, data(:, 1:7), key), [], 2) == 1, :);

I use min because the result of bsxfun is a vector with 7 match flags and I obviously want all of them to be true. It seems to me that this could be bottleneck of a Matlab algorithm)

2
Semantically, I would always prefer all(X,2) over min(X,[],2)==1 for a logical array. Not sure if it is any faster though.Florian
"This could be a bottleneck" - you can actually check whether it is! Try using the profiler to display which lines / sections of your code are using the most time. uk.mathworks.com/help/matlab/ref/profile.html Namely, profile on; <yourCodeHere>; profile viewer;Wolfie

2 Answers

2
votes

Maybe with ismember and some indexing:

% locates in B the last ocurrence of each key in A. idxA has logicals of
% those keys found, and idxB tells us where in B.
[idxA, idxB] = ismember(A(:,1:7), B(:,1:7),'rows'); 
C = [ A zeros(size(A, 1), 2) ];
C(idxA, 10:11) = B(idxB(idxA), 8:9); % idxB(idxA) are the idxB != 0
1
votes

I think this does what you want, only tested with your simple example.

% Initial matrices
A = [0 0 0 0 0 0 0 10 20;
     0 0 0 0 0 0 1 30 40];
B = [0 0 0 0 0 0 0 50 60];

% Stack matrices with common key columns, 8&9 or 10&11 for data columns
C = [[A, zeros(size(A,1),2)]; [B(:,1:7), zeros(size(B,1),2), B(:,8:9)]];
% Sort C so that matching key rows will be consecutive
C = sortrows(C,1:7);

% Loop through rows
curRow = 1;
lastRow = size(C,1) - 1;
while curRow < lastRow
    if all(C(curRow,1:7) == C(curRow+1,1:7))
        % If first 7 cols of 2 rows match, take max values (override 0s)
        % It may be safer to initialise the 0 columns to NaNs, as max will
        % choose a numeric value over NaN, and it allows your data to be
        % negative values.
        C(curRow,8:11) = max(C(curRow:curRow+1, 8:11));
        % Remove merged row
        C(curRow+1,:) = [];
        % Decrease size counter for matrix
        lastRow = lastRow - 1;
    else
        % Increase row counter
        curRow = curRow + 1;        
    end
end

Answer:

C = [0     0     0     0     0     0     0    10    20    50    60
     0     0     0     0     0     0     1    30    40     0     0]