0
votes

Ok here is the formula in matlab:

function D = dumDistance(X,Y)
n1 = size(X,2);
n2 = size(Y,2);
D = zeros(n1,n2);
for i = 1:n1
    for j = 1:n2
        D(i,j) = sum((X(:,i)-Y(:,j)).^2);
    end
end

Credits here (I know it's not a fast implementation but for the sake of the basic algorithm).

Now here is my understanding problem;

Say that we have a matrix dictionary=140x100 words. And a matrix page=140x40 words. Each column represents a word in the 140 dimensional space.

Now, if I use dumDistance(page,dictionairy) it will return a 40x100 matrix with the distances.

What I want to achieve, is to find how close is each word of page matrix to the dictionary matrix, in order to represent the page according to dictionary with a histogram let's say.

I know, that If I take the min(40x100), ill get a 1x100 matrix with locations of min values to represent my histogram.

What I really cant understand here, is this 40x100 matrix. What data does this matrix represents anyway? I cant visualize this in my mind.

1

1 Answers

1
votes

Minor comment before I start:

You should really use pdist2 instead. This is much faster and you'll get the same results as dumDistance. In other words, you would call it like this:

D = pdist2(page.', dictionary.');

You need to transpose page and dictionary as pdist2 assumes that each row is an observation, while each column corresponds to a variable / feature. Your data is structured such that each column is an observation. This will return a 40 x 100 matrix like what you see in dumDistance. However, pdist2 does not use for loops.


Now onto your question:

D(i,j) represents the Euclidean squared distance between word i from your page and word j from your dictionary. You have 40 words on your page and 100 words in your dictionary. Each word is represented by a 140 dimensional feature vector, and so the rows of D index the words of page while the columns of D index the words of dictionary.

What I mean here in terms of "distance" is in terms of the feature space. Each word from your page and dictionary are represented as a 140 length vector. Each entry (i,j) of D takes the ith vector from page and the jth vector from dictionary, each of their corresponding components subtracted, squared, and then they are summed up. This output is then stored into D(i,j). This gives you the dissimilarity between word i from your page and word j from your dictionary at D(i,j). The higher the value, the more dissimilar the two words are.

Minor Note: pdist2 computes the Euclidean distance while dumDistance computes the Euclidean squared distance. If you want to have the same thing as dumDistance, simply square every element in D from pdist2. In other words, simply compute D.^2.

Hope this helps. Good luck!