Can't use __m128i in Cuda kernel

Question

I am trying to compile a simple program that uses __m128i using cuda, but when I compile using nvcc (nvcc test.cu -o test) on Linux, I get "__m128i" is a vector, which is not supported in device code. This is the program I am trying to compile

#include <stdio.h>
#include <emmintrin.h>

__global__ void hello(){
    printf("%d\n",threadIdx.x);
    __m128i x;

}
int main(){
   hello<<<3,3>>>();
}

When I type nvcc --version, I get Cuda compilation tools, release 10.2, V10.2.89

I actually faced this problem on a larger scale trying to implement some cpp code using CUDA and this cpp code uses __m128i, and what I have shown is the simple version of the problem I am facing, so I am wondering if there is a way to use __m128i in a CUDA kernel, or some other alternative. Thanks

@TedLyngmo: That post talks about GNU C __uint128_t which is totally unrelated to __m128i, other than having the same size. An SSE integer vector isn't a 128-bit integer type; the widest element size is _mm_add_epi64. (Unless you're only using bitwise boolean operations, then element boundaries don't matter.) — Peter Cordes
@TedLyngmo: But __m128i isn't a 128-bit integer; it's a SIMD vector. In GNU C, defined as typedef long long __m128i __attribute__((vector_size(16), may_alias)). Having a scalar 128-bit integer type supported by CUDA wouldn't help you compile code that uses __m128i with intrinsics like _mm_shuffle_epi32, _mm_add_epi32, and so on (treating it as a vector of 4x 32-bit integers), or _mm_minpos_epu16 (horizontal min and min-position of 16-bit unsigned elements), or other SSE hardware operations. You can't use __m128i as a single 128-bit integer, so that's not what the OP wants. — Peter Cordes
@PeterCordes Ah ... now I get what you're saying. :) Sorry for the confusion. Removing comment. — Ted Lyngmo

talonmies talonmies · Accepted Answer · 2020-08-20T12:25:46

I am wondering if there is a way to use __m128i in a CUDA kernel ...

There is not. CUDA has native 128 bit integer types which meet the same alignment properties as __m128i, but a host vector type is not supported.

or some other alternative

As noted above, there are 16 byte aligned types which can be used to load and store data, but there is no native 128 bit SIMD intrinsic support in NVIDIA GPUs. Those SIMD instructions which exist are limited to 32 bit types.

CPU SIMD is done with short vectors like 128-bit __m128i. GPU SIMD is done across warps, and not usually software-visible in the same way as __m128i CPU SIMD, you just write it as scalar code.

Code manually vectorized with __m128i can't be compiled for a GPU. If it has a scalar fallback version, use that, e.g. #undef __SSE2__.

(CUDA SIMD within 32-bit chunks lets you get more use out of the 32-bit wide ALUs in each GPU execution unit if you have narrow data, like pairs of 16-bit integers, or 4x 8-bit integers. So if your SSE intrinsics code uses _mm_add_epi8, you might still benefit from manual vectorization in CUDA with its 4x 8-bit operations instead of 16x 8-bit.)

Can't use __m128i in Cuda kernel

1 Answers