I'm wondering if there is any chance to improve performance of such compacting. The idea is to saturate values higher than 4095 and place each value every 12 bits in new continuous buffer. Just like that:
Concept:
Convert:
Input buffer: [0.0][0.1][0.2] ... [0.15] | [1.0][1.1][1.2] ... [1.15] | [2.0][2.1][2.2] ... [2.15] etc ...
to:
Output buffer: [0.0][0.1][0.2] ... [0.11] | [1.0][1.1][1.2] ... [1.11] | [2.0][2.1][2.2] ... [2.11] etc ...
The input and output buffers are defines as:
uint16_t input[76800] (it's size in Bytes equal 153600 Bytes)
uint24_t output[38400] (it's size in Bytes equal 115200 Bytes)
So I have reduced the data size by 1/4. This computation cost ~1ms on Cortex-A9 with 792 MHz CPU speed and 2 Cores. I have to perform such "compression" because I transfer about 18MB/s over Ethernet and that gives me huge overhead. I've tested various compression algorithms such Snappy, LZ4 and none of that was even close to achieved 1 ms with saturation and bits schifting.
I've written the following code:
#pragma pack(push, 1)
typedef struct {
union {
struct {
uint32_t value0_24x1:24;
};
struct {
uint32_t value0_12x1:12;
uint32_t value1_12x1:12;
};
struct {
uint32_t value0_8x1:8;
uint32_t value1_8x1:8;
uint32_t value3_8x1:8;
};
};
} uint24_t;
#pragma pack(pop)
static inline uint32_t __attribute__((always_inline)) saturate(uint32_t value)
{
register uint32_t result;
asm volatile("usat %0, %2, %1 \n\t" \
: [result] "=r" (result) \
: [value] "r" (value), [saturate] "I" (12) \
: \
);
return result;
}
void __attribute__((noinline, used)) compact(const uint16_t *input, uint24_t *output, uint32_t elements)
{
#if 0
/* More readable, but slower */
for (uint32_t i = 0; i < elements; ++i) {
output->value0_12x1 = saturate(*input++);
(output++)->value1_12x1 = saturate(*input++);
}
#else
/* Alternative - less readable but faster */
for (uint32_t i = 0; i < elements; ++i, input += 2)
(output++)->value0_24x1 = saturate(*input) | ((uint32_t)saturate(*(input+1))) << 12;
#endif
}
static uint16_t buffer_in[76800] = {0};
static uint24_t buffer_out[38400] = {0};
int main()
{
/* Dividing by 2 because we process two input values in a single loop inside compact() */
compact(buffer_in, buffer_out, sizeof(buffer_in) / sizeof(buffer_in[0]) / 2);
return 0;
}
And it's Assembly:
248 00008664 <compact>:
249 8664: e92d4010 push {r4, lr}
250 8668: e3a03000 mov r3, #0
251 866c: ea00000c b 86a4 <compact+0x40>
252 8670: e1d040b0 ldrh r4, [r0]
253 8674: e6ec4014 usat r4, #12, r4
254 8678: e1d0c0b2 ldrh ip, [r0, #2]
255 867c: e6ecc01c usat ip, #12, ip
256 8680: e184c60c orr ip, r4, ip, lsl #12
257 8684: e2833001 add r3, r3, #1
258 8688: e2800004 add r0, r0, #4
259 868c: e5c1c000 strb ip, [r1]
260 8690: e7e7445c ubfx r4, ip, #8, #8
261 8694: e7e7c85c ubfx ip, ip, #16, #8
262 8698: e5c14001 strb r4, [r1, #1]
263 869c: e5c1c002 strb ip, [r1, #2]
264 86a0: e2811003 add r1, r1, #3
265 86a4: e1530002 cmp r3, r2
266 86a8: 1afffff0 bne 8670 <compact+0xc>
267 86ac: e8bd8010 pop {r4, pc}
Compiled using GCC 4.6.3 with the following CFLAGS:
-Os (-O2 and -O3 do not give any noticable improvements)
-march=armv7-a -mcpu=cortex-a9 -mtune=cortex-a9
-marm -mfloat-abi=softfp -mfpu=neon funsafe-math-optimizations
Benchmark has shown that we're using ~10.3 cycles per 1 data convertion.
The questions are:
- Can I use NEON to improve the performance?
- Can someone give me some hints regardles NEON? What intrinsics shall I use?
Some code example would be very welcome, because I'm completly noob when it comes to NEON.
memcpy()
; see Cortex-A8 fastest memory copy; the concepts are the same whether using NEON or justusat
. Thepld
can get data to L1/L2 cache quicker. Also, apldw
may help with writes. Your cache line is 8 words (32bits), so processing this amount and ensuring it is aligned will do as much as the op codes used. – artless noise"usat %[result], %[saturate], %[value]"
. I would also remove thevolatile
, since this could (under certain circumstances) force this code to be executed more than necessary. – David Wohlferdmemcpy()
and the results were better with stock one provided by eglibc 2.14 (afair) than all of that implementations proposed by infocenter.arm for Cortex-A8. I do not remember details but the one I have uses justpld
,ldmia
andstmia
in general (and I was told that this is the best choice for Cortex-A9). If I'm wrong I'll correct myself on Monday when I'll be back at home. Btw you're right about usingpld
together withusat
- I had better results. – Piotr Nowak