I'm aware of the existing penalty for switching from AVX instructions to SSE instructions without first zeroing out the upper halves of all ymm registers, but in my particular case on my machine (i7-3939K 3.2GHz), there seems to be a very large penalty for going the other way around (SSE to AVX), even if I do explicitly use _mm256_zeroupper before and after the AVX code section.
I have written functions for converting between 32 bit floats and 32 bit fixed point integers, on 2 buffers that are 32768 elements wide. I ported an SSE2 intrinsic version directly to AVX to do 8 elements at once over SSE's 4, expecting to see a significant performance increase, but unfortunately, the opposite happened.
So, I have 2 functions:
void ConvertPcm32FloatToPcm32Fixed(int32* outBuffer, const float* inBuffer, uint sampleCount, bool bUseAvx)
{
const float fScale = (float)(1U<<31);
if (bUseAvx)
{
_mm256_zeroupper();
const __m256 vScale = _mm256_set1_ps(fScale);
const __m256 vVolMax = _mm256_set1_ps(fScale-1);
const __m256 vVolMin = _mm256_set1_ps(-fScale);
for (uint i = 0; i < sampleCount; i+=8)
{
const __m256 vIn0 = _mm256_load_ps(inBuffer+i); // Aligned load
const __m256 vVal0 = _mm256_mul_ps(vIn0, vScale);
const __m256 vClamped0 = _mm256_min_ps( _mm256_max_ps(vVal0, vVolMin), vVolMax );
const __m256i vFinal0 = _mm256_cvtps_epi32(vClamped0);
_mm256_store_si256((__m256i*)(outBuffer+i), vFinal0); // Aligned store
}
_mm256_zeroupper();
}
else
{
const __m128 vScale = _mm_set1_ps(fScale);
const __m128 vVolMax = _mm_set1_ps(fScale-1);
const __m128 vVolMin = _mm_set1_ps(-fScale);
for (uint i = 0; i < sampleCount; i+=4)
{
const __m128 vIn0 = _mm_load_ps(inBuffer+i); // Aligned load
const __m128 vVal0 = _mm_mul_ps(vIn0, vScale);
const __m128 vClamped0 = _mm_min_ps( _mm_max_ps(vVal0, vVolMin), vVolMax );
const __m128i vFinal0 = _mm_cvtps_epi32(vClamped0);
_mm_store_si128((__m128i*)(outBuffer+i), vFinal0); // Aligned store
}
}
}
void ConvertPcm32FixedToPcm32Float(float* outBuffer, const int32* inBuffer, uint sampleCount, bool bUseAvx)
{
const float fScale = (float)(1U<<31);
if (bUseAvx)
{
_mm256_zeroupper();
const __m256 vScale = _mm256_set1_ps(1/fScale);
for (uint i = 0; i < sampleCount; i+=8)
{
__m256i vIn0 = _mm256_load_si256(reinterpret_cast<const __m256i*>(inBuffer+i)); // Aligned load
__m256 vVal0 = _mm256_cvtepi32_ps(vIn0);
vVal0 = _mm256_mul_ps(vVal0, vScale);
_mm256_store_ps(outBuffer+i, vVal0); // Aligned store
}
_mm256_zeroupper();
}
else
{
const __m128 vScale = _mm_set1_ps(1/fScale);
for (uint i = 0; i < sampleCount; i+=4)
{
__m128i vIn0 = _mm_load_si128(reinterpret_cast<const __m128i*>(inBuffer+i)); // Aligned load
__m128 vVal0 = _mm_cvtepi32_ps(vIn0);
vVal0 = _mm_mul_ps(vVal0, vScale);
_mm_store_ps(outBuffer+i, vVal0); // Aligned store
}
}
}
So I start a timer, run ConvertPcm32FloatToPcm32Fixed then ConvertPcm32FixedToPcm32Float to convert straight back, end the timer. The SSE2 versions of the functions execute for a total of 15-16 microseconds, but the AVX versions take 22-23 microseconds. A bit perplexed, I dug a bit further, and I have discovered how to speed up the AVX versions so that they go faster than the SSE2 versions, but it's cheating. I simply run ConvertPcm32FloatToPcm32Fixed before starting the timer, then start the timer, and run ConvertPcm32FloatToPcm32Fixed again, then ConvertPcm32FixedToPcm32Float, stop the timer. As if there's a massive penalty for SSE to AVX, if I "prime" the AVX version first with a trial run, the AVX execution time drops to 12 microseconds, while doing the same thing with the SSE equivalents only drops the time down by a microsecond to 14, making AVX the marginal winner here, but only if I cheat. I considered that maybe AVX doesn't play as nicely with the cache as SSE, but using _mm_prefetch does nothing to help it either.
Am I missing something here?
VMULPS
rather thanMULPS
- is the above code what you're actually using for your tests or does the real code have separate modules compiled with/without-mavx
? – Paul R