4
votes

I'm aware of the existing penalty for switching from AVX instructions to SSE instructions without first zeroing out the upper halves of all ymm registers, but in my particular case on my machine (i7-3939K 3.2GHz), there seems to be a very large penalty for going the other way around (SSE to AVX), even if I do explicitly use _mm256_zeroupper before and after the AVX code section.

I have written functions for converting between 32 bit floats and 32 bit fixed point integers, on 2 buffers that are 32768 elements wide. I ported an SSE2 intrinsic version directly to AVX to do 8 elements at once over SSE's 4, expecting to see a significant performance increase, but unfortunately, the opposite happened.

So, I have 2 functions:

void ConvertPcm32FloatToPcm32Fixed(int32* outBuffer, const float* inBuffer, uint sampleCount, bool bUseAvx)
{
    const float fScale = (float)(1U<<31);

    if (bUseAvx)
    {
        _mm256_zeroupper();
        const __m256 vScale = _mm256_set1_ps(fScale);
        const __m256 vVolMax = _mm256_set1_ps(fScale-1);
        const __m256 vVolMin = _mm256_set1_ps(-fScale);

        for (uint i = 0; i < sampleCount; i+=8)
        {
            const __m256 vIn0 = _mm256_load_ps(inBuffer+i); // Aligned load
            const __m256 vVal0 = _mm256_mul_ps(vIn0, vScale);
            const __m256 vClamped0 = _mm256_min_ps( _mm256_max_ps(vVal0, vVolMin), vVolMax );
            const __m256i vFinal0 = _mm256_cvtps_epi32(vClamped0);
            _mm256_store_si256((__m256i*)(outBuffer+i), vFinal0); // Aligned store
        }
        _mm256_zeroupper();
    }
    else
    {
        const __m128 vScale = _mm_set1_ps(fScale);
        const __m128 vVolMax = _mm_set1_ps(fScale-1);
        const __m128 vVolMin = _mm_set1_ps(-fScale);

        for (uint i = 0; i < sampleCount; i+=4)
        {
            const __m128 vIn0 = _mm_load_ps(inBuffer+i); // Aligned load
            const __m128 vVal0 = _mm_mul_ps(vIn0, vScale);
            const __m128 vClamped0 = _mm_min_ps( _mm_max_ps(vVal0, vVolMin), vVolMax );
            const __m128i vFinal0 = _mm_cvtps_epi32(vClamped0);
            _mm_store_si128((__m128i*)(outBuffer+i), vFinal0); // Aligned store
        }
    }
}

void ConvertPcm32FixedToPcm32Float(float* outBuffer, const int32* inBuffer, uint sampleCount, bool bUseAvx)
{
    const float fScale = (float)(1U<<31);

    if (bUseAvx)
    {
        _mm256_zeroupper();
        const __m256 vScale = _mm256_set1_ps(1/fScale);

        for (uint i = 0; i < sampleCount; i+=8)
        {
            __m256i vIn0 = _mm256_load_si256(reinterpret_cast<const __m256i*>(inBuffer+i)); // Aligned load
            __m256 vVal0 = _mm256_cvtepi32_ps(vIn0);
            vVal0 = _mm256_mul_ps(vVal0, vScale);
            _mm256_store_ps(outBuffer+i, vVal0); // Aligned store
        }
        _mm256_zeroupper();
    }
    else
    {
        const __m128 vScale = _mm_set1_ps(1/fScale);

        for (uint i = 0; i < sampleCount; i+=4)
        {
            __m128i vIn0 = _mm_load_si128(reinterpret_cast<const __m128i*>(inBuffer+i)); // Aligned load
            __m128 vVal0 = _mm_cvtepi32_ps(vIn0);
            vVal0 = _mm_mul_ps(vVal0, vScale);
            _mm_store_ps(outBuffer+i, vVal0); // Aligned store
        }
    }
}

So I start a timer, run ConvertPcm32FloatToPcm32Fixed then ConvertPcm32FixedToPcm32Float to convert straight back, end the timer. The SSE2 versions of the functions execute for a total of 15-16 microseconds, but the AVX versions take 22-23 microseconds. A bit perplexed, I dug a bit further, and I have discovered how to speed up the AVX versions so that they go faster than the SSE2 versions, but it's cheating. I simply run ConvertPcm32FloatToPcm32Fixed before starting the timer, then start the timer, and run ConvertPcm32FloatToPcm32Fixed again, then ConvertPcm32FixedToPcm32Float, stop the timer. As if there's a massive penalty for SSE to AVX, if I "prime" the AVX version first with a trial run, the AVX execution time drops to 12 microseconds, while doing the same thing with the SSE equivalents only drops the time down by a microsecond to 14, making AVX the marginal winner here, but only if I cheat. I considered that maybe AVX doesn't play as nicely with the cache as SSE, but using _mm_prefetch does nothing to help it either.

Am I missing something here?

2
Can you provide an SSCCE?Mysticial
For the SSE code are you using old SSE (destructive) or new SSE (non-destructive) instructions ? My understanding is that the AVX-SSE switching penalty applies only to the former ?Paul R
Old SSE destructive actually. But from reading documentation, it shouldn't matter anyway, because I'm getting hurt when going to AVX rather than from AVX.Kumputer
C++ code has been added. It seems there really is a penalty for switching to AVX, as I just tested it by running the SSE2 version before the timer to cache the buffers, then running the AVX ones after, but the performance penalty is still there.Kumputer
If you're compiling the SSE code with -mavx then it should be using the new (non-destructive) SSE instructions, e.g. VMULPS rather than MULPS - is the above code what you're actually using for your tests or does the real code have separate modules compiled with/without -mavx ?Paul R

2 Answers

5
votes

I did not test your code, but since your test appears quite short, maybe you're seeing the Floating point warm-up effect that Agner Fog discusses on p.101 of his microarchitecture manual (this applies to Sandy Bridge architecture). I quote:

The processor is in a cold state when it has not seen any floating point instructions for a while. The latency for 256-bit vector additions and multiplications is initially two clocks longer than the ideal number, then one clock longer, and after several hundred floating point instructions the processor goes to the warm state where latencies are 3 and 5 clocks respectively. The throughput is half the ideal value for 256-bit vector operations in the cold state. 128-bit vector operations are less affected by this warm-up effect. The latency of 128-bit vector additions and multiplications is at most one clock cycle longer than the ideal value, and the throughput is not reduced in the cold state.

2
votes

I was under the impression that unless the compiler encodes the SSE instructions using the VEX instruction format, as Paul R said - vmulps instead of mulps, the hit is massive.

When optimizing small segments, I tend to use this nice tool from Intel in tandem with some good ol' benchmarks

https://software.intel.com/en-us/articles/intel-architecture-code-analyzer

The report generated by IACA includes this notation:

"@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected"