I am processing audio buffers in Android, the setup I have is as follows:
- get system callback with a short buffer
- convert short buffer to float buffer
- do some DSP with float buffer
- convert float buffer to short buffer
- deliver short buffer to system
I want to reduce the latency of steps 2 and 4, the short to float and float to short conversions. (leaving aside the latency in the 3- DSP since I will take care of that later).
So, I would like to use NEON SIMD to calculate multiple values at a time.
What I currently have for 2 and 4 is the following code:
#define CONV16BIT 32768
#define CONVMYFLT (1./32768.)
static int i;
float * floatBuffer;
short * shortInBuffer;
short * shortOutBuffer;
...(malloc and init buffers method)
...(inside callback)
//2- short to float
for(i = 0; i < bufferSize; i++) {
floatBuffer[i] = (float) (shortInBuffer[i] * CONVMYFLT);
}
...(do dsp)
//4- float to short
for(i = 0; i < bufferSize; i++) {
shortOutBuffer[i] = (short) (floatBuffer[i] * CONV16BIT);
}
I believe that the steps I need for taking advantage of NEON are:
(for the short to float part)
- Load the 16-bit shorts form short buffer
- Convert them to 32-bit integers
- Convert them to float
- Multiply them by CONVMYFLT
- Store them into float buffer
Found this info in this post (selected answer)
__m128 factor = _mm_set1_ps(1.0f / value);
for (int i = 0; i < W*H; i += 8)
{
// Load 8 16-bit ushorts.
// vi = {a,b,c,d,e,f,g,h}
__m128i vi = _mm_load_si128((const __m128i*)(source + i));
// Convert to 32-bit integers
// vi0 = {a,0,b,0,c,0,d,0}
// vi1 = {e,0,f,0,g,0,h,0}
__m128i vi0 = _mm_cvtepu16_epi32(vi);
__m128i vi1 = _mm_cvtepu16_epi32(_mm_unpackhi_epi64(vi,vi));
// Convert to float
__m128 vf0 = _mm_cvtepi32_ps(vi0);
__m128 vf1 = _mm_cvtepi32_ps(vi1);
// Multiply
vf0 = _mm_mul_ps(vf0,factor);
vf1 = _mm_mul_ps(vf1,factor);
// Store
_mm_store_ps(destination + i + 0,vf0);
_mm_store_ps(destination + i + 4,vf1);
}
However this is SIMD for intel SSE4.1 not for NEON.
What would be the equivalent implementation for NEON in Android? (had a hard time understanding the NEON intrinsics)
Update 1 From the answer of fsheikh I was able to build this: - I was able to get int16_t from the system callback - and all my buffer sizes are multiple of 8:
int16x8_t i16v;
int32x4_t i32vl, i32vh;
float32x4_t f32vl, f32vh;
for(i = 0; i < bufferSize; i += 8) {
//load 8 16-bit lanes on vector
i16v = vld1q_s16((const int16x8_t*) int16_t_inBuffer[i]);
// convert into 32-bit signed integer
i32vl = vmovl_s16 (i16v);
i32vh = vmovl_s16 (vzipq_s16(i16v, i16v).val[0]);
//convert to 32-bit float
f32vl = vcvtq_f32_s32(i32vl);
f32vh = vcvtq_f32_s32(i32vh);
//multiply by scalar
f32vl = vmulq_n_f32(f32vl, CONVMYFLT);
f32vh = vmulq_n_f32(f32vh, CONVMYFLT);
//store in float buffer
vst1q_f32(floatBuffer[i], f32vl);
vst1q_f32(floatBuffer[i + 4], f32vh);
}
Should this work right? I have doubts over i should use the low or high part of the interleaved vector returned by vmovl_s16:
i32vh = vmovl_s16 (vzipq_s1 6(i16v, i16v).val[ 0 ]); or
i32vh = vmovl_s16 (vzipq_s16(i16v, i16v).val[ 1 ]);