14
votes

I have a matrix NxM (usually 10k X 10k elements) describing a ground set. Each line represents an object and each column an specific feature. For example, in the matrix

   f1 f2 f3
x1 0  4  -1
x2 1  0  5
x3 4  0  0
x4 0  1  0

the object x1 has value 0 in feature 1, value 4 in feature 1, and value 0 in feature -1. The values of this are general real numbers (double's).

I have to compute several custom distances/dissimilarities between all pairs of objects (all pair of lines). To compare, I want to compute the L1 (manhattan) and L2 (euclidean) distances.

I have use Eigen library to perform the most of my computations. To compute the L2 (euclidean), I use the following observation: for two vectors a and b of size n, we have:

||a - b||^2 = (a_1 - b_1)^2 + (a_2 - b_2)^2 + ... +(a_n - b_n)^2
            = a_1^2 + b_1^2 - 2 a_1 b_1 + a_2^2 + b_2^2 - 2 a_2 b_2 + ... + a_n^2 + b_n^2 - 2 a_n b_n
            = a . a + b . b - 2ab

In other words, we rewrite the squared norm using the dot product of the vector by themselves and subtract twice the dot product between them. From that, we just take the square and we are done. In time, I found this trick a long time ago and unfortunately I lost the reference for the author.

Anyway, this enable to write a fancy code using Eigen (in C++):

Eigen::Matrix<double, Eigen::Dynamic, Eigen::Dynamic> matrix, XX, D;

// Load matrix here, for example
// matrix << 0, 4, -1,
//           1, 0,  5,
//           4, 0,  0,
//           0, 1,  0;

const auto N = matrix.rows();

XX.resize(N, 1);
D.resize(N, N);

XX = matrix.array().square().rowwise().sum();

D.noalias() = XX * Eigen::MatrixXd::Ones(1, N) +
              Eigen::MatrixXd::Ones(N, 1) * XX.transpose();

D -= 2 * matrix * matrix.transpose();
D = D.cwiseSqrt();

For matrix 10k X 10k, we are able to compute the L2 distance for all pairs of objects/lines in less than 1 min (2 cores / 4 threads), which I personally consider a good performance for my purposes. Eigen is able to combine the operations and to use several low/high level optimizations to perform these computations. In this case, Eigen uses all cores available (and, of course, we can tune that).

However, I still need compute the L1 distance, but I couldn't figure out a good algebraic form to use with Eigen and obtain nice performance. Until now I have the following:

const auto N = matrix.rows();
for(long i = 0; i < N - 1; ++i) {
    const auto &row = matrix.row(i);

    #ifdef _OPENMP
    #pragma omp parallel for shared(row)
    #endif
    for(long j = i + 1; j < N; ++j) {
        distance(i, j) = (row - matrix.row(j)).lpNorm<1>();
    }
}

But this take much longer time: for the same 10k X 10k matrix, this code uses 3.5 min, which is much worse considering that L1 and L2 are very close in their original form:

L1(a, b) = sum_i |a_i - b_i|
L2(a, b) = sqrt(sum_i |a_i - b_i|^2)

Any idea to how change L1 to use fast computations with Eigen? Or a better form to do that and I just didn't figure out.

Thank you very much for your help!

Carlos

2
This does not answer your question, but note that if you have only 2 physical cores, then you should enable only 2 threads as hyperthreading slows down CPU intensive operations. You can also initialize D using replicate: D = XX.replicate(1,n) + XX.transpose().replicate(n,1);ggael
I'm going to go out on a limb here... Notice that you're manipulating rows. However, by default, Eigen matrices are in column-major order (eigen.tuxfamily.org/dox-devel/group__QuickRefPage.html). That means that whenever you're calling row(), Eigen has to read from a lot of non-contiguous areas of memory. Could it be that you will get better performance/lower number of cache misses if you switch to row-major order? Note that matrix multiplications used by the L2 norm aren't as affected by this, since the underlying operations are optimized for both orders via the 'T' parameters in dgemmPatrick Mineault
@PatrickMineault Yes, you are right. Indeed, I have change the matrix order to speed up a little bit. It makes a good difference but not that one I am looking for. Anyway, thanks for the notice.an_drade
If it was possible to use 8-bit values, you could use the instruction or intrinsic _mm_mpsadbw_epu8. In this way, you can do 8 8-byte sum of absolute differences in 9 clock cycles. software.intel.com/sites/default/files/m/a/9/b/7/b/1000-SSE.pdf.Jens Munk
I can't help you mathematically but I can assure you that there are faster ways to do it programmatically. for a 10k by 10k matrix may be worth considering the GPU. Also my own experience shows that using SIMD vector instructions is much faster than just using openmp for parallelization. So regardless of using openmp or not you should write your code to use SIMDSebastian Cabot

2 Answers

2
votes

Lets just calculate both distances at the same time. They only really share the row difference (while both could be absolute difference, the euclidean distance uses square which isn't really equivalent). So now we're only looping through the rows n^2.

const auto N = matrix.rows();
for(long i = 0; i < N - 1; ++i) {
    const auto &row = matrix.row(i);

    #ifdef _OPENMP
    #pragma omp parallel for shared(row)
    #endif
    for(long j = i + 1; j < N; ++j) {
        const auto &rowDiff = row - matrix.row(j);
        distanceL1(i, j) = rowDiff.cwiseAbs().sum(); // or .lpNorm<1>(); if it's faster
        distanceL2(i, j) = rowDiff.norm()
    }
}

EDIT another more memory intensive / untested method could be to calculate a whole distance row each iteration (don't know if this would be an improvement or not)

const auto N = matrix.rows();
#ifdef _OPENMP
#pragma omp parallel for shared(matrix)
#endif
for(long i = 0; i < N - 1; ++i) {
    const auto &row = matrix.row(i);
    // you could use matrix.block(i,j,k,l) to cut down on the number of unnecessary operations
    const auto &mat = matrix.rowwise() - row;

    distanceL1(i) = mat.cwiseAbs().sum().transpose();
    distanceL2(i) = mat.rowwise().norm().transpose();
}
0
votes

These are two very common operations in image processing. The first the Sum of Squared Differences (SSD), the second is the Sum of Absolute Differences (SAD).

You've correctly determined that SSD just requires one to compute the cross-correlation between the two series as the main computation. However, you may want to consider using the FFT to compute these a.b terms, it will cut down the number of operations needed significantly for the L2 case (however how much I do not know, it depends on what matrix-matrix multiplication algorithm Eigen uses.) If you need me to explain this I can, but I figure you can also look it up as its a standard use of FFTs. OpenCV has a (pretty bad/buggy) implementation of template matching, which is what you want when using CV_TM_SQDIFF.

L1 case is trickier. The L1 case can't be decomposed as nicely, but it is also one of the simplest operations you can do (just additions & absolute values.) As such, a lot of computation architectures have parallelized implementations of this as instructions or hardware implemented functions. Other architectures have researchers experimenting with the best way to compute this.

You may also want to look into Intel Imaging Primitives for cross-correlation, and fast FFT libraries such as FFTW & CUFFT. If you can't afford the Intel Imaging Primitves, you can use SSE instructions to greatly speed up your processing to almost the same effect.