What is the best way to parallelize Hough Transform algorithm?

Question

In Hough Line Transform, for each edge pixel, we find the corresponding Rho and Theta in the Hough parameter space. The accumulator for Rho and Theta should be global. If we want to parallelize the algorithm, what is the best way to split the accumulator space?

I don't see why you would split the accumulator space. Instead, split the original image. This way you parallelize both edge detection and the transform. — svinja

Stefan Holdermans Stefan Holdermans · Accepted Answer · 2014-12-03T15:14:38

What is the best way to parallelise an algorithm may depend on several aspects. An important such aspect is the hardware that you are targeting. As you have tagged your question with "openmp", I assume that, in your case, the target is an SMP system.

To answer your question, let us start by looking at a typical, straightforward implementation of the Hough transform (I will use C, but what follows applies to C++ and Fortran as well):

size_t *hough(bool *pixels, size_t w, size_t h, size_t res, size_t *rlimit)
{
  *rlimit = (size_t)(sqrt(w * w + h * h));

  double step = M_PI_2 / res;
  size_t *accum = calloc(res * *rlimit, sizeof(size_t));
  size_t x, y, t;

  for (x = 0; x < w; ++x)
    for (y = 0; y < h; ++y)
      if (pixels[y * w + x])
        for (t = 0; t < res; ++t)
        {
          double theta = t * step;
          size_t r = x * cos(theta) + y * sin(theta);
          ++accum[r * res + t];
        }

  return accum;
}

Given an array of black-and-white pixels (stored row-wise), a width, a height and a target resolution for the angle-component of the Hough space, the function hough returns an accumulator array for the Hough space (organised "angle-wise") and stores an upper bound for its distance dimension in the output argument rlimit. That is, the number of elements in the returned accumulator array is given by res * (*rlimit).

The body of the function centres on three nested loops: the two outermost loops iterate over the input pixels, while the conditionally executed innermost loop iterates over the angular dimension of the Hough space.

To parallelise the algorithm, we have to somehow decompose it into pieces that can execute concurrently. Typically such a decomposition is induced either by the structure of the computation or otherwise by the structure of the data that are operated on.

As, besides iteration, the only computationally interesting task that is carried out by the function is the trigonometry in the body of the innermost loop, there are no obvious opportunities for a decomposition based on the structure of the computation. Therefore, let us focus on decompositions based on the structure of the data, and let us distinguish between

data decompositions that are based on the structure of the input data, and
data decompositions that are based on the structure of the output data.

The structure of the input data, in our case, is given by the pixel array that is passed as an argument to the function hough and that is iterated over by the two outermost loops in the body of the function.

The structure of the output data is given by the structure of the returned accumulator array and is iterated over by the innermost loop in the body of the function.

We first look at output-data decomposition as, for the Hough transform, it leads to the simplest parallel algorithm.

Output-data decomposition

Decomposing the output data into units that can be produced relatively independently materialises into having the iterations of the innermost loop execute in parallel.

Doing so, one has to take into account any so-called loop-carried dependencies for the loop to parallelise. In this case, this is straightforward as there are no such loop-carried dependencies: all iterations of the loop require read-write accesses to the shared array accum, but each iteration operates on its own "private" segment of the array (i.e., those elements that have indices i with i % res == t).

Using OpenMP, this gives us the following straightforward parallel implementation:

size_t *hough(bool *pixels, size_t w, size_t h, size_t res, size_t *rlimit)
{
  *rlimit = (size_t)(sqrt(w * w + h * h));

  double step = M_PI_2 / res;
  size_t *accum = calloc(res * *rlimit, sizeof(size_t));
  size_t x, y, t;

  for (x = 0; x < w; ++x)
    for (y = 0; y < h; ++y)
      if (pixels[y * w + x])
        #pragma omp parallel for
        for (t = 0; t < res; ++t)
        {
          double theta = t * step;
          size_t r = x * cos(theta) + y * sin(theta);
          ++accum[r * res + t];
        }

  return accum;
}

Input-data decomposition

A data decomposition that follows the structure of the input data can be obtained by parallelising the outermost loop.

That loop, however, does have a loop-carried flow dependency as each loop iteration potentially requires read-write access to each cell of the shared accumulator array. Hence, in order to obtain a correct parallel implementation we have to synchronise these accumulator accesses. In this case, this can easily be done by updating the accumulators atomically.

The loop also carries two so-called antidependencies. These are induced by the induction variables y and t of the inner loops and are trivially dealt with by making them private variables of the parallel outer loop.

The parallel implementation that we end up with then looks like this:

size_t *hough(bool *pixels, size_t w, size_t h, size_t res, size_t *rlimit)
{
  *rlimit = (size_t)(sqrt(w * w + h * h));

  double step = M_PI_2 / res;
  size_t *accum = calloc(res * *rlimit, sizeof(size_t));
  size_t x, y, t;

  #pragma omp parallel for private(y, t)
  for (x = 0; x < w; ++x)
    for (y = 0; y < h; ++y)
      if (pixels[y * w + x])
        for (t = 0; t < res; ++t)
        {
          double theta = t * step;
          size_t r = x * cos(theta) + y * sin(theta);
          #pragma omp atomic
          ++accum[r * res + t];
        }

  return accum;
}

Evaluation

Evaluating the two data-decomposition strategies, we observe that:

For both strategies, we end up with a parallelisation in which the computationally heavy parts of the algorithm (the trigonometry) are nicely distributed over threads.
Decomposing the output data gives us a parallelisation of the innermost loop in the function hough. As this loop does not have any loop-carried dependencies we do not incur any data-synchronisation overhead. However, as the innermost loop is executed for every set input pixel, we do incur quite a lot of overhead due to repeatedly forming a team of threads etc.
Decomposing the input data gives a parallelisation of the outermost loop. This loop is only executed once and so the threading overhead is minimal. On the other hand, however, we do incur some data-synchronisation overhead for dealing with a loop-carried flow dependency.

Atomic operations in OpenMP can typically be assumed to be quite efficient, while threading overheads are known to be rather large. Consequently, one expects that, for the Hough transform, input-data decomposition gives a more efficient parallel algorithm. This is confirmed by a simple experiment. For this experiment, I applied the two parallel implementations to a randomly generated 1024x768 black-and-white picture with a target resolution of 90 (i.e., 1 accumulator per degree of arc) and compared the results with the sequential implementation. This table shows the relative speedups obtained by the two parallel implementations for different numbers of threads:

# threads | OUTPUT DECOMPOSITION | INPUT DECOMPOSITION
----------+----------------------+--------------------
     2    |          1.2         |         1.9         
     4    |          1.4         |         3.7
     8    |          1.5         |         6.8

(The experiment was carried out on a otherwise unloaded dual 2.2 GHz quad-core Intel Xeon E5520. All speedups are averages over five runs. The average running time of the sequential implementation was 2.66 s.)

False sharing

Note that the parallel implementations are susceptible to false sharing of the accumulator array. For the implementation that is based on decomposition of the output data this false sharing can to a large extent be avoided by transposing the accumulator array (i.e., by organising it "distance-wise"). Doing so and measuring the impact, did, in my experiments, not result in any observable further speedups.

Conclusion

Returning to your question, "what is the best way to split the accumulator space?", the answer seems to be that it is best not to split the accumulator space at all, but instead split the input space.

If, for some reason, you have set your mind on splitting the accumulator space, you may consider changing the structure of the algorithm so that the outermost loops iterate over the Hough space and the inner loop over whichever is the smallest of the input picture's dimensions. That way, you can still derive a parallel implementation that incurs threading overhead only once and that comes free of data-synchronisation overhead. However, in that scheme, the trigonometry can no longer be conditional and so, in total, each loop iteration will have to do more work than in the scheme above.