1
votes

What I want to do is feed in my m x n matrix, and in parallel, construct n square diagonal matrices for each column of the matrix, perform an operation on each square diagonal matrix, and then recombine the result. How do I do this?

So far, I start of with an m x n matrix; the result from a previous matrix computation where each element is calculated using the function y = f(g(x)).

This gives me a matrix with n column elements [f1, f2...fn] where each fn represents a column vector of height m.

From here, I want to differentiate each column of the matrix with respect to g(x). Differentiating fn(x) w.r.t. g(x) results in a square matrix with elements f'(x). Under constraint, this square matrix reduces to a Jacobian with the elements of each row along the diagonal of the square matrix, and equal to fn', all other elements equaling zero.

Hence the reason why it is necessary to construct the diagonal for each of the vector rows fn.

To do this, I take a target vector defined as A(hA x 1) which was extracted from the larger A(m x n) matrix. I then prepared a zeroed matrix defined as C(hA x hA) which will be used to hold the diagonals.

The aim is to diagonalize the vector A into a square matrix with each element of A sitting on the diagonal of C, everything else being zero.

There are probably more efficient ways to accomplish this using some pre-built routine without building a whole new kernel, but please be aware that for these purposes, this method is necessary.

The kernel code (which works) to accomplish this is shown here:

_cudaDiagonalizeTest << <5, 1 >> >(d_A, matrix_size.uiWA, matrix_size.uiHA, d_C, matrix_size.uiWC, matrix_size.uiHC);

__global__ void _cudaDiagonalizeTest(float *A, int wA, int hA, float *C, int wC, int hC)
{
    int ix, iy, idx;

    ix = blockIdx.x * blockDim.x + threadIdx.x;
    iy = blockIdx.y * blockDim.y + threadIdx.y;

    idx = iy * wA + ix;

    C[idx * (wC + 1)] = A[idx];

}

I am a bit suspicious that this is a very naive approach to a solution and was wondering if someone could give an example of how I could do the same using

a) reduction

b) thrust

For vectors of large row size, I would like to be able to use the GPU's multithreading capabilities to chunk the task into small jobs, and combine each result at the end with __syncthreads().

The picture below shows what the desired result is.

I have read NVIDIA's article on reduction, but did not manage to achieve the desired results.

Any assistance or explanation would be very much welcomed.

enter image description here Thanks.

enter image description here

Matrix A is the target with 4 columns. I want to take each column, and copy its elements into Matrix B as a diagonal, iterating through each column.

2
I'm not sure I follow exactly what you're looking for. Could you please include a sample input and desired output? I don't see how reduction applies to this problem. - Christian Sarofeen
Are you really sure you need to do this at all? If you have a purely diagonal matrix, the best way to store and use it is as you already have it - as a diagonal array. You will use up a lot of memory, a lot of memory bandwidth and a lot of flops just loading and storing zeros, usually for no good reason - talonmies
The diagonalization of the afformentioned vector row is only a small step in a larger operation. What I am attempting is to diagonalize each row of a m x n matrix in parallel, perform computations with these n diagonalized square matrices (there are n rows in the m x n matrix and hence n diagonalized square matrices after having diagonalized each one), and then sum the results of the computation back together again. All this must be done in the kernel. Is there an efficient way of doing this? - guerrillacodester
Are you sure your sample kernel is really doing that then? Anyways, let me try again to guess what you mean: you want to copy A(x,y) to C(x,x+y), where M(x,y) means the entry of M in the x-th row and y-th column? (using 0-based indices) and... maybe filling in the rest of C with zeroes? Or maybe you mean to copy A(x,y) to C(x, (x+y) mod N), where N is the number of columns of C? - user1084944
Refer to the second image for a clearer picture. Matrix A is a smaller 5 x 4 matrix (5 rows, 4 columns). Matrix B is a larger 5 x 20 matrix (5 rows, 20 columns). I want to take each of the 4 columns of Matrix A (each one being a vector of height 5) and lay out its elements into a diagonal. Each diagonal construction would produce a 5 x 5 matrix which would then fit into the larger 5 x 20 Matrix B. I want to do this in parallel using CUDA. Please refer to the image posted at the bottom of the original post. - guerrillacodester

2 Answers

2
votes

I created a simple example based on thrust. It uses column-major order to store the matrices in a thrust::device_vector. It should scale well with larger row/column counts.

Another approach could be based off the thrust strided_range example.

This example does what you want (fill the diagonals based on the input vector). However, depending on how you proceed with the resulting matrix to your "Differentiating" step, it might still be worth investigating if a sparse storage (without all the zero entries) is possible, since this will reduce memory consumption and ease iterating.

#include <thrust/device_vector.h>
#include <thrust/scatter.h>
#include <thrust/sequence.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/functional.h>
#include <iostream>


template<typename V>
void print_matrix(const V& mat, int rows, int cols)
{
   for(int i = 0; i < rows; ++i)
   {
     for(int j = 0; j < cols; ++j)
     {
      std::cout << mat[i + j*rows] << "\t";
     }
     std::cout << std::endl;
   }
}

struct diag_index : public thrust::unary_function<int,int>
{
  diag_index(int rows) : rows(rows){}

  __host__ __device__
  int operator()(const int index) const
  {
      return (index*rows + (index%rows));
  }

  const int rows;
};

int main()
{
  const int rows = 5; 
  const int cols = 4;

  // allocate memory and fill with demo data
  // we use column-major order
  thrust::device_vector<int> A(rows*cols);
  thrust::sequence(A.begin(), A.end());

  thrust::device_vector<int> B(rows*rows*cols, 0);

  // fill diagonal matrix
  thrust::scatter(A.begin(), A.end(), thrust::make_transform_iterator(thrust::make_counting_iterator(0),diag_index(rows)), B.begin());

  print_matrix(A, rows, cols);
  std::cout << std::endl;
  print_matrix(B, rows, rows*cols);
  return 0;
}

This example will output:

0    5    10    15    
1    6    11    16    
2    7    12    17    
3    8    13    18    
4    9    14    19    

0    0    0    0    0    5    0    0    0    0    10    0    0    0    0    15    0    0    0    0    
0    1    0    0    0    0    6    0    0    0    0    11    0    0    0    0    16    0    0    0    
0    0    2    0    0    0    0    7    0    0    0    0    12    0    0    0    0    17    0    0    
0    0    0    3    0    0    0    0    8    0    0    0    0    13    0    0    0    0    18    0    
0    0    0    0    4    0    0    0    0    9    0    0    0    0    14    0    0    0    0    19    
-1
votes

An alternate answer that does not use thrust is as follows:

_cudaMatrixTest << <5, 5 >> >(d_A, matrix_size.uiWA, matrix_size.uiHA, d_C, matrix_size.uiWC, matrix_size.uiHC);

__global__ void _cudaMatrixTest(float *A, int wA, int hA, float *C, int wC, int hC)
{
    int ix, iy, idx;

    ix = blockIdx.x * blockDim.x + threadIdx.x;
    iy = blockIdx.y * blockDim.y + threadIdx.y;

    idx = iy * wA + ix;

    C[idx * wC + (idx % wC)] = A[threadIdx.x * wA + (ix / wC)];
}

where d_A is

0    5    10    15    
1    6    11    16    
2    7    12    17    
3    8    13    18    
4    9    14    19    

Both answers are viable solutions. The question is, which is better/faster?