2
votes

Assume that I have 2 strings of characters:

AACCCGGAAATTTGGAATTTTCCCCAAATACG

CGATGATCGATGAATTTTAGCGGATACGATTC

I want to find by how much I should move the second string such that it matches the first one the most.

There are 2 cases. The first one is that we assume that the string are wrapped around, and the second one is that we don't.

Is there a matlab function that does returns either a N array or 2N+1 array of values for how much the shifted string 2 correlates with string 1?

If not, is there a faster/simpler method than something like

result = zeroes(length, 1)
for i = 0:length-1
    result(i+1) = sum (str1 == circshift(str2, i));
end
2
You might like to take a look at Bioinformatics Toolbox, which contains implementations of Smith-Waterman and Needleman-Wunsch alignment algorithms. - Sam Roberts

2 Answers

5
votes

You can convert each char into a binary column of size 4:

A -> [1;0;0;0]
C -> [0;1;0;0]
G -> [0;0;1;0]
T -> [0;0;0;1]

As a result a string of length n becomes a binary matrix of size 4-by-n.

You can now cross-correlate (along X axis only) the two n-by-4 and m-by-4 to get your result.

4
votes

With a hat tip to John d'Errico:

str1 = 'CGATGATCGATGAATTTTAGCGGATACGATTC';
str2 = 'AACCCGGAAATTTGGAATTTTCCCCAAATACG';

% the circulant matrix
n = length(str2);
C = str2( mod(bsxfun(@plus,(0:n-1)',0:n-1),n)+1 ); %//'

% Find the maximum number of matching characters, and the amount 
% by which to shift the string to achieve this result
[score, shift] = max( sum(bsxfun(@eq, str1, C), 2) );

Faster yes, simpler...well, I'll leave that up to you to decide :)

Note that this method trades memory for speed. That is, it creates the matrix of all possible shifts in memory (efficiently), and compares the string to all rows of this matrix. That matrix will contain elements, so if N becomes large, it's better to use the loop (or Shai's method).