0
votes

OpenGL fragment shader: how much difference in computation time between working on "4 times of 1 channel" vs "1 time of 4 channels"?

For example, I could do the computation by 1 channel each time, and I do 4 times.

Or I could put all date in 4 channels, and do it for 1 time.

Some things to consider: (a) some overload for one fragment shader loading, (b) the time of texture fetch of 1 channel is almost equal to texture fetch of 4 channel? Compared to one multiplication in the shader, how much is the time of texture fetech? If the time of texture fetech is not much and there are many calculation steps (involving many multiplication, adding etc), then we do not need to consider texture fetech time much.

(c) how much difference in computation time of 4 times of float a * float a and 1 times of vec4(a, a, a, a) * vec4(a, a, a, a)?

I know for sure that "1 time of 4 channels" is faster than "4 times of 1 channel" But I want to know how faster it is.

The reason I consider "4 times of 1 channel" because the whole implementation involves several passes. For example, input texture 1, render to texture 2. This means there are two textures existing at the same time. After we calculate texture 2, we could delete texture 1. So we need one extra texture for GPU memory. For 1 channel, this means one extra texture of 1 channel for GPU memory. For 4 channels, this means one extra texture of 4 channels for GPU memory. So this causes space difference. (This is just a simple example. The real implementation should involve more steps)

I want to balance of the trade off between GPU memory and GPU computation time.

Any idea or resource to those questions?

2
"I know for sure that "1 time of 4 channels" is faster than "4 times of 1 channel" But I want to know how faster it is." I would like to interject that you can use a texture gather to get 4 single-channel texels (the neighbors used for linear texture filtering) with the same amount of work as 1 four-channel texel. Whether you can use that to your advantage performance wise, I could not say - but there are some valid uses for that approach on DX11 class hardware.Andon M. Coleman
D3D's HLSL language reference does a much better job explaining how that feature works if you are interested in it.Andon M. Coleman

2 Answers

0
votes

This is not quite straight forward and depends on your use case. If your input data is interleaved i.e. normal RGBA, most likely processing 4 channels at once and using only 1 pass is better. If your data is interleaved and you process one channel at a time, you will be performing the same amount of calculations but at 4 times the cost of memory access. The reason is that even though you read only one channel, all 4 are still loaded and then 3 of them are discarded. If your data is separated by channels, i.e. an array of all R channel values then an array of all G values and so on, then processing 1 channel at a time is better, and you are going only 1 pass over your data.

After all, look at how your data is organized and perform tests and measurements.

0
votes

I think I found partly answer.

In terms of computation time of 1 times of float a * float a and 1 times of vec4(a, a, a, a) * vec4(a, a, a, a),

So we need to use vec4 operation as much as possible.

They should be the same, according to "Chapter 35. GPU Program Optimization" from GPU Gems 2.