UPDATE: The span issues that were mentioned previously were fixed in the .net core 2.1 release (which is currently in preview.) These actually made the Span Vector *faster* than the array Vector...
NB: Testing this on a "Intel Xeon E5-1660 v4" which CPU-Z tells me has Instructions for "MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, EM64T, VT-x, AES, AVX, AVX2, FMA3, RSX" so it should be OK...
Off the back of answering a Vector based question, I thought I would try to implement some BLAS functions. I found that ones that were reading/summing such as dot product were pretty good, but were I was writing back to an array were bad - better than non-SIMD, but barely.
So am I doing something wrong, or is there more work in the JIT required?
The example (assuming x.Length = y.Length, not null etc. blah, blah):
public static void daxpy(double alpha, double[] x, double[] y)
{
for (var i = 0; i < x.Length; ++i)
y[i] = y[i] + x[i] * alpha;
}
In Vector form becomes:
public static void daxpy(double alpha, double[] x, double[] y)
{
var i = 0;
if (Vector.IsHardwareAccelerated)
{
var length = x.Length + 1 - Vector<double>.Count;
for (; i < length; i += Vector<double>.Count)
{
var valpha = new Vector<double>(alpha);
var vx = new Vector<double>(x, i);
var vy = new Vector<double>(y, i);
(vy + vx * valpha).CopyTo(y, i);
}
}
for (; i < x.Length; ++i)
y[i] = y[i] + x[i] * alpha;
}
And, playing around in .NET Core 2.0, I though I would try Span, both naive and Vector form:
public static void daxpy(double alpha, Span<double> x, Span<double> y)
{
for (var i = 0; i < x.Length; ++i)
y[i] += x[i] * alpha;
}
And Vector
public static void daxpy(double alpha, Span<double> x, Span<double> y)
{
if (Vector.IsHardwareAccelerated)
{
var vx = x.NonPortableCast<double, Vector<double>>();
var vy = y.NonPortableCast<double, Vector<double>>();
var valpha = new Vector<double>(alpha);
for (var i = 0; i < vx.Length; ++i)
vy[i] += vx[i] * valpha;
x = x.Slice(Vector<double>.Count * vx.Length);
y = y.Slice(Vector<double>.Count * vy.Length);
}
for (var i = 0; i < x.Length; ++i)
y[i] += x[i] * alpha;
}
So the relative timings on all these are:
Naive 1.0
Vector 0.8
Span Naive 2.5 ==> Update: Span Naive 1.1
Span Vector 0.9 ==> Update: Span Vector 0.6
So am I doing something wrong? I could hardly think of a simpler example, so I don't think so?
IL
generated by vectorized version ? - TigranSystem.Buffer.BlockCopy
and then the* alpha
step. - Dai