9
votes

I'm trying to use System.Numerics.Vector(T) to vectorize an algorithm and take advantage of SIMD operations of the CPU. However, my vector implementation was substantially slower than my original implementation. Is there any trick to using Vectors that may not have been documented? The specific use here is to try to speed up Xors of kb of data.

Unfortunately, almost all of the documentation I can find on it is based on the a pre-release version of RyuJIT, and I don't know how much of that material is portable to .NET Native.

When I inspect the disassembly during a Vector xor operation, it shows:

00007FFB040A9C10  xor         eax,eax  
00007FFB040A9C12  mov         qword ptr [rcx],rax  
00007FFB040A9C15  mov         qword ptr [rcx+8],rax  
00007FFB040A9C19  mov         rax,qword ptr [r8]  
00007FFB040A9C1C  xor         rax,qword ptr [rdx]  
00007FFB040A9C1F  mov         qword ptr [rcx],rax  
00007FFB040A9C22  mov         rax,qword ptr [r8+8]  
00007FFB040A9C26  xor         rax,qword ptr [rdx+8]  
00007FFB040A9C2A  mov         qword ptr [rcx+8],rax  
00007FFB040A9C2E  mov         rax,rcx  

Why doesn't it use the xmm registers and SIMD instructions for this? What's also odd is that SIMD instructions were generated for a version of this code that I hadn't explicitly vectorized, but they were never being executed, in favor of the regular registers and instructions.

I ensured that I was running with Release, x64, Optimize code enabled. I saw similar behavior with x86 compilation. I'm somewhat novice at machine-level stuff, so its possible there's just something going on here that I'm not properly understanding.

Framework version is 4.6, Vector.IsHardwareAccelerated is false at runtime.

Update: "Compile with .NET Native tool chain" is the culprit. Enabling it causes Vector.IsHardwareAccelerated == false; Disabling it causes Vector.IsHardwareAccelerated == true. I've confirmed that when .NET Native is disabled, the compiler is producing AVX instructions using the ymm registers. Which leads to the question... why is SIMD not enabled in .NET Native? And is there any way to change that?

Update Tangent: I discovered that the reason the auto-SSE-vectorized array code wasn't being executed was because the compiler had inserted an instruction that looked to see if the start of the array was at a lower address than one of the last elements of the array, and if it was, to just use the normal registers. I think that must be a bug in the compiler, because the start of an array should always be at a lower address than its last elements by convention. It was part of a set of instructions testing the memory addresses of each of the operand arrays, I think to make sure they were non-overlapping. I've filed a Microsoft Connect bug report for this: https://connect.microsoft.com/VisualStudio/feedback/details/1831117

1
What Framework version is this? Is hardware acceleration reported to be true?usr
Framework version 4.6, and IsHardwareAccelerated returns false.Nick Bauer
why is SIMD not enabled in .NET Native? I can only venture a guess: SIMD is handled by the JIT (Just-in-time compiler, the thing that transforms at runtime IL code into native code). .NET native entirely bypasses the JIT by creating a purely native assembly (with no need of translation). I guess they simply didn't implement SIMD support into the .NET native tool chain. Either because they didn't have time yet, or because .NET native could be used to create programs running on CPUs that don't have SIMD registersKevin Gosse
KooKiz, that makes some sense, yet .NET Native does use the SSE registers and instructions, to a degree, and I'm given to understand that there is a way to assess whether a CPU has the instructions or not. So I can only assume that it is due to the former. However, there was a blog post touting both .NET Native and SIMD in Universal Windows Apps at the same time, suggesting that both should be possible now: blogs.msdn.com/b/dotnet/archive/2015/07/30/…Nick Bauer
Vector<> in System.Numerics.Vector.dll version 4.1.0.0 requires a processor that supports the AVX instruction set. At least Intel Sandy Bridge or AMD Bulldozer. That's a problem with an ahead-of-time compiler like .NET Native, no guarantee that the target machine has such a processor.Hans Passant

1 Answers

9
votes

I contacted Microsoft, who posted a contact address for .Net Native questions and concerns: https://msdn.microsoft.com/en-us/vstudio/dotnetnative.aspx

My question was referred to Ian Bearman, Principal Software Engineering Manager in the Microsoft Code Generation and Optimization Technologies Team:

Currently .NET Native does not optimize the System.Numerics library and relies on the default library implementation. This may (read: will likely) result in code written using System.Numerics to not perform as well in .NET Native as it will against other CLR implementations.

While this is unfortunate, .NET Native does support auto-vectorization which comes with using the C++ optimizations mentioned above. The current shipping .NET Native compiler supports SSE2 ISA in its auto-vectorization on x86 and x64 and NEON ISA on ARM.

He also mentioned that they want to bring over from the C++ compiler the ability to generate all vector instructions (AVX, SSE, etc.) and branch based on detection of the instruction set at runtime.

He then suggested that if usage of instructions is really critical, the component can be built in C++, which has access to the compiler intrinsics (and presumably this branching ability?) and then easily interfaced to the remaining C# application.

As for the skipped-over SSE2 instructions, all I needed to do to get it to compile to the right instructions was to replace a looped "a = a ^ b" with "a ^= b". Since they should be equivalent expressions, it appears that it is a bug, but fortunately one with a workaround.