I have implemented a Matrix datatype in C++ by using 1D datatype and wrapping it into rows and columns. Now, I want to have this possibility to create square/blocked sub-matrices from this time and I want to do it in-memory.
The problem is that I want some of these sub-matrices to be transferable to GPU memory and can process them there in parallel. This is for example, useful for Matrix Multiplication. As these submatrices are not aligned in main-memory, copying them to device memory as a single unit looks impossible without creating separate copy? I want to have this direct GPU sub-matrix copy mapping to CPU-original matrix for updation and efficiency purpose. I don't know about exact partitioning in advance.
Do someone has some idea how can I achieve it possibly?
Just a reminder, matrix needs to be partitioned in blocks and not row-wise which will be relatively easy in C/C++.