With the help of YOU, I have used SSE in my code (sample below) with significant performance boost and I was wondering if this boost could be improved by using 256bit registers of AVX.
int result[4] __attribute__((aligned(16))) = {0};
__m128i vresult = _mm_set1_epi32(0);
__m128i v1, v2, vmax;
for (int k = 0; k < limit; k += 4) {
v1 = _mm_load_si128((__m128i *) & myVector[positionNodeId + k]);
v2 = _mm_load_si128((__m128i *) & myVector2[k]);
vmax = _mm_add_epi32(v1, v2);
vresult = _mm_max_epi32(vresult, vmax);
}
_mm_store_si128((__m128i *) result, vresult);
return max(max(max(result[0], result[1]), result[2]), result[3]);
So, I have 3 questions: How would the above rather simple SSE code could be converted to AVX? WHat header should I import for that? And what flag should I tell my gcc compiler (instead of -sse4.1) for AVX to work?
Thanks in advance. for your help.