2
votes

I am trying to convert an existing OpenCL kernel to an HLSL compute shader.

The OpenCL kernel samples each pixel in an RGBA texture and writes each color channel to a tighly packed array.

So basically, I need to write to a tightly packed uchar array in a pattern that goes somewhat like this:

r r r ... r g g g ... g b b b ... b a a a ... a

where each letter stands for a single byte (red / green / blue / alpha) that originates from a pixel channel.

going through the documentation for RWByteAddressBuffer Store method, it clearly states:

void Store(
  in uint address,
  in uint value
);

address [in]

Type: uint

The input address in bytes, which must be a multiple of 4.

In order to write the correct pattern to the buffer, I must be able to write a single byte to a non aligned address. In OpenCL / CUDA this is pretty trivial.

  • Is it technically possible to achieve that with HLSL?
  • Is this a known limitation? possible workarounds?
1

1 Answers

3
votes

As far as I know it is not possible to write directly to a non aligned address in this scenario. You can, however, use a little trick to achieve what you want. Below you can see the code of the entire compute shader which does exactly what you want. The function StoreValueAtByte in particular is what you are looking for.

Texture2D<float4> Input;
RWByteAddressBuffer Output;

void StoreValueAtByte(in uint index_of_byte, in uint value) {

    // Calculate the address of the 4-byte-slot in which index_of_byte resides
    uint addr_align4 = floor(float(index_of_byte) / 4.0f) * 4;

    // Calculate which byte within the 4-byte-slot it is
    uint location = index_of_byte % 4;

    // Shift bits to their proper location within its 4-byte-slot
    value = value << ((3 - location) * 8);

    // Write value to buffer
    Output.InterlockedOr(addr_align4, value);
}

[numthreads(20, 20, 1)]
void CSMAIN(uint3 ID : SV_DispatchThreadID) {

    // Get width and height of texture
    uint tex_width, tex_height;
    Input.GetDimensions(tex_width, tex_height);

    // Make sure thread does not operate outside the texture
    if(tex_width > ID.x && tex_height > ID.y) {

        uint num_pixels = tex_width * tex_height;

        // Calculate address of where to write color channel data of pixel
        uint addr_red = 0 * num_pixels + ID.y * tex_width + ID.x;
        uint addr_green = 1 * num_pixels + ID.y * tex_width + ID.x;
        uint addr_blue = 2 * num_pixels + ID.y * tex_width + ID.x;
        uint addr_alpha = 3 * num_pixels + ID.y * tex_width + ID.x;

        // Get color of pixel and convert from [0,1] to [0,255]
        float4 color = Input[ID.xy];
        uint4 color_final = uint4(round(color.x * 255), round(color.y * 255), round(color.z * 255), round(color.w * 255));      

        // Store color channel values in output buffer
        StoreValueAtByte(addr_red, color_final.x);
        StoreValueAtByte(addr_green, color_final.y);
        StoreValueAtByte(addr_blue, color_final.z);
        StoreValueAtByte(addr_alpha, color_final.w);
    }
}

I hope the code is self explanatory since it is hard to explain, but I'll try anyway.
The fist thing the function StoreValueAtByte does is to calculate the address of the 4-byte-slot enclosing the byte you want to write to. After that the position of the byte inside the 4-byte-slot is calculated (is it the fist, second, third or the fourth byte in the slot). Since the byte you want to write is already inside an 4-byte variable (namely value) and occupies the rightmost byte, you then just have to shift the byte to its proper position inside the 4-byte variable. After that you just have to write the variable value to the buffer at the 4-byte-aligned address. This is done using bitwise OR because multiple threads write to the same address interfering each other leading to write-after-write-hazards. This of course only works if you initialize the entire output buffer with zeros before issuing the dispatch-call.