17
votes

Long story short, I have done several prototypes of interactive software. I use pygame now (python sdl wrapper) and everything is done on CPU. I am starting to port it to C now and at the same time search for the existing possibilities to use some GPU power to entlast the CPU from redundant operations. However I cannot find a good "guideline" what exact technology/tools should I pick in my situation. I just read plethora of docs, it drains my mental powers very fast. I am not sure if it is possible at all, so I'm puzzled.
Here I've made a very rough sketch of my typical application skeleton that I develop, but given that it uses GPU now (note, I have almost zero practical knowledge about GPU programming). Still important is that data types and functionality must be exactly preserved. Here it is:
enter image description here

So F(A,R,P) is some custom function, for example element substitution, repetition, etc. Function is presumably constant in program lifetime, rectangle's shapes generally are not equal with A shape, so it is not in-place calculation. So they are simply generated whith my functions. Examples of F: repeat rows and columns of A; substitute values with values from Substitution tables; compose some tiles into single array; any math function on A values, etc. As said all this can be easily made on CPU, but app must be really smooth. BTW in pure Python it became just unusable after adding several visual features, which are based on numpy arrays. Cython helps to make fast custom functions but then the source code is already kind of a salad.

Question:

  • Does this schema reflect some (standart) technology/dev.tools?

  • Is CUDA what I am looking for? If yes, some links/examples which coincides whith my application structure, would be great.

I realise, this a big question, so I will give more details if it helps.


Update

Here is a concrete example of two typical calculations for my prototype of bitmap editor. So the editor works with indexes and the data include layers with corresponding bit masks. I can determine the size of layers and masks are same size as layers and, say, all layers are same size (1024^2 pixels = 4 MB for 32 bit values). And my palette is say, 1024 elements (4 Kilobytes for 32 bpp format).
Consider I want to do two things now:

Step 1. I want to flatten all layers in one. Say A1 is default layer (background) and layers 'A2' and 'A3' have masks 'm2' and 'm3'. In python i'd write:

from numpy import logical_not
...
Result = (A1 * logical_not(m2) + A2 * m2) * logical_not(m3) + A3 * m3

Since the data is independent I believe it must give speedup proportionl to number of parallel blocks.

Step 2. Now I have an array and want to 'colorize' it with some palette, so it will be my lookup table. As I see now, there is a problem with simultanous read of lookup table element. enter image description here

But my idea is, probably one can just duplicate the palette for all blocks, so each block can read its own palette? Like this: enter image description here

3
FYI: There is Theano (Python) that can use the GPU. It compiles symbolic expressions to CUDA code that runs on your GPU.Stefan Falk
If there is a lot of independent data going through that (hundeds of MB/s), than it can really speed things up. Only problem I see is in "resulting rectangles can be arbitrary size". I've never heard/read about allocating through kernels. If it's size is known before start, it's ok, but arrays must be defined before GPU kernel starts. So it basically depends on F (both input and output) and size of Ablind.wolf
@blind.wolf Yes, the sizes are known, it just must not be of the same shape as A, for example in case of integer array upscaling.Mikhail V
Why would anyone vote to close this question?????barak manos
@buttifulbuttefly well, I think so. Imagine you have several arrays, and you want to get an average values, for example. Like layers in Photoshop, but only index values. And now move one layer around and see changing result in real time. But to start with, I wanted smth. very very simple, like in mentioned examples with substitution or row/column repetition. If I could make my array filled with palette values, it would alone be a nice improvement. And substitution can be perfectly parallelised, am I right?Mikhail V

3 Answers

2
votes

When your code is highly parallel (i.e. there are small or no data dependencies between stages of processing) then you can go for CUDA (more finegrained control over synching) or OpenCL (very similar AND portable OpenGL-like API to interface with the GPU for kernel processing). Most of the acceleration work we do happens in OpenCL, which has excellent interop with both OpenGL and DirectX, but we also have the same setup working with CUDA. One big difference between CUDA and OpenCL is that in CUDA you can compile kernels once and delay-load (and/or link) them in your app, whereas in OpenCL the compiler plays nice with the OpenCL driver stack to ensure the kernel is compiled when the app starts.

One alternative that is often overlooked if you're using Microsoft Visual Studio is C++AMP, a C++ syntax-friendly and intuitive api for those who do not want to dig into the logic twists and turns of OpenCL/CUDA API's. Big advantage here is that the code also works if you do not have a GPU in the system, but then you do not have as many options to tweak performance. Still, in a lot of cases, this is a fast and efficient way to write proof your concept code and re-implement bits and parts in CUDA or OpenCL later.

OpenMP and Thread Building Blocks are only good alternatives when you have synching issues and lots of data dependencies. Native threading using worker threads is also a viable solution, but only if you have a good idea on how synch-points can be set up between the different processes in such a way that threads do not starve each-other out when fighting for priority. This is a lot harder to get right, and tools such as Parallel Studio are a must. But then, so is NVida NSight if you're writing GPU code.

Appendix:

A new platform called Quasar (http://quasar.ugent.be/blog/) is being developed that enables you to write your math problems in a syntax that is very similar to Matlab, but with full support of c/c++/c# or java integration, and cross-compiles (LLVM, CLANG) your "kernel" code to any underlying hardware configuration. It generates CUDA ptx files, or runs on openCL, or even on your CPU using TBB's, or a mixture of them. Using a few monikers, you can decorate the algorithm so that the underlying compiler can infer types (you can also explicitly use strict typing), so you can leave the type-heavy stuff entirely up to the compiler. To be fair, at the time of writing, the system is still w.i.p. and the first OpenCL compiled programs are just being tested, but most important benefit is fast prototyping with almost identical performance compared to optimized cuda.

1
votes

There is not much difference between OpenCL/CUDA so choose which works better for you. Just remember that CUDA will limit you to the NVidia GPUs.

If i understand corretly to your problem, kernel (function executed on GPU) should be simple. It should follow this pseudocode:

kernel main(shared A, shared outA, const struct R, const struct P, const int maxOut, const int sizeA)
  int index := getIndex() // get offset in input array
  if(sizeA >= index) return // GPU often works better when n of threads is 2^n
  int outIndex := index*maxOut // to get offset in output array
  outA[outIndex] := F(A[index], R, P)
end

Functions F should be inlined and you can use switch or if for different function. Since there is not known size of the output of F, then you have to use more memory. Each kernel instance must know positions for correct memory writes and reads so there have to be some maximum size (if there is none, than this all is useless and you have to use CPU!). If different sizes are sparse, then I would use something like computing these different sizes after getting the array back to RAM and compute these few with CPU, while filling outA with some zeros or indication values.

Sizes of arrays are obviously length(A) * maxOut = length(outA).

I forgot to mention that if execution of F is not same in most of the cases (same source code), than GPU will serialize it. GPU multiprocessors have a few cores connected into the same instruction cache so it will have to serialize the code, which is not the same for all cores! OpenMP or threads are better choice for this kind of problem!

1
votes

What you want to do is send values really fast to the GPU using the high frequency dispatch and then display the result of a function which is basically texture lookups and some parameters.

I would say this problem will only be worth solving on the GPU if two conditions are met:

  1. The size of A[] is optimised to make the transfer times irrelevant (Look at, http://blog.theincredibleholk.org/blog/2012/11/29/a-look-at-gpu-memory-transfer/).

  2. The lookup table is not too big and/or the lookup values are organized in a way that the cache can be maximally utilized, in general random lookups on the GPU can be slow, ideally you can pre-load the R[] values in a shared memory buffer for each element of the A[] buffer.

If you can answer both of those questions positively then and only then consider having a go at using the GPU for your problem, else those 2 factors will overpower the computational speed-up that the GPU can provide you with.

Another thing you can have a look at is to as best as you can overlap the transfer and computing times to hide as much as possible the slow transfer rates of CPU->GPU data.

Regarding your F(A, R, P) function you need to make sure that you do not need to know the value of F(A, R, P)[0] in order to know what the value of F(A, R, P)[1] is because if you do then you need to rewrite F(A, R, P) to go around this issue, using some parallelization technique. If you have a limited number of F() functions then this can be solved by writing a parallel version of each F() function for the GPU to use, but if F() is user-defined then your problem becomes a bit trickier.

I hope this is enough information to have an informed guess towards whether you should or not use a GPU to solve your problem.

EDIT

Having read your edit, I would say yes. The palette could fit in shared memory (See GPU shared memory size is very small - what can I do about it?) which is very fast, if you have more than one palette, you could fit 16KB (size of shared mem on most cards) / 4KB per palette = 4 palettes per block of threads.

One last warning, integer operations are not the fastest on the GPU, consider using floating points if necessary after you have implemented your algorithm and it is working as a cheap optimization.