I'm trying to use System.Numerics.Vector(T) to vectorize an algorithm and take advantage of SIMD operations of the CPU. However, my vector implementation was substantially slower than my original implementation. Is there any trick to using Vectors that may not have been documented? The specific use here is to try to speed up Xors of kb of data.
Unfortunately, almost all of the documentation I can find on it is based on the a pre-release version of RyuJIT, and I don't know how much of that material is portable to .NET Native.
When I inspect the disassembly during a Vector xor operation, it shows:
00007FFB040A9C10 xor eax,eax
00007FFB040A9C12 mov qword ptr [rcx],rax
00007FFB040A9C15 mov qword ptr [rcx+8],rax
00007FFB040A9C19 mov rax,qword ptr [r8]
00007FFB040A9C1C xor rax,qword ptr [rdx]
00007FFB040A9C1F mov qword ptr [rcx],rax
00007FFB040A9C22 mov rax,qword ptr [r8+8]
00007FFB040A9C26 xor rax,qword ptr [rdx+8]
00007FFB040A9C2A mov qword ptr [rcx+8],rax
00007FFB040A9C2E mov rax,rcx
Why doesn't it use the xmm registers and SIMD instructions for this? What's also odd is that SIMD instructions were generated for a version of this code that I hadn't explicitly vectorized, but they were never being executed, in favor of the regular registers and instructions.
I ensured that I was running with Release, x64, Optimize code enabled. I saw similar behavior with x86 compilation. I'm somewhat novice at machine-level stuff, so its possible there's just something going on here that I'm not properly understanding.
Framework version is 4.6, Vector.IsHardwareAccelerated is false at runtime.
Update: "Compile with .NET Native tool chain" is the culprit. Enabling it causes Vector.IsHardwareAccelerated == false; Disabling it causes Vector.IsHardwareAccelerated == true. I've confirmed that when .NET Native is disabled, the compiler is producing AVX instructions using the ymm registers. Which leads to the question... why is SIMD not enabled in .NET Native? And is there any way to change that?
Update Tangent: I discovered that the reason the auto-SSE-vectorized array code wasn't being executed was because the compiler had inserted an instruction that looked to see if the start of the array was at a lower address than one of the last elements of the array, and if it was, to just use the normal registers. I think that must be a bug in the compiler, because the start of an array should always be at a lower address than its last elements by convention. It was part of a set of instructions testing the memory addresses of each of the operand arrays, I think to make sure they were non-overlapping. I've filed a Microsoft Connect bug report for this: https://connect.microsoft.com/VisualStudio/feedback/details/1831117
true
? – usrwhy is SIMD not enabled in .NET Native?
I can only venture a guess: SIMD is handled by the JIT (Just-in-time compiler, the thing that transforms at runtime IL code into native code). .NET native entirely bypasses the JIT by creating a purely native assembly (with no need of translation). I guess they simply didn't implement SIMD support into the .NET native tool chain. Either because they didn't have time yet, or because .NET native could be used to create programs running on CPUs that don't have SIMD registers – Kevin Gosse