Compacting data in buffer from 16 bit per element to 12 bits

Question

I'm wondering if there is any chance to improve performance of such compacting. The idea is to saturate values higher than 4095 and place each value every 12 bits in new continuous buffer. Just like that:

Concept:

Convert:

Input buffer: [0.0][0.1][0.2] ... [0.15] | [1.0][1.1][1.2] ... [1.15] | [2.0][2.1][2.2] ... [2.15] etc ...

to:

Output buffer: [0.0][0.1][0.2] ... [0.11] | [1.0][1.1][1.2] ... [1.11] | [2.0][2.1][2.2] ... [2.11] etc ...

The input and output buffers are defines as:

uint16_t input[76800] (it's size in Bytes equal 153600 Bytes)

uint24_t output[38400] (it's size in Bytes equal 115200 Bytes)

So I have reduced the data size by 1/4. This computation cost ~1ms on Cortex-A9 with 792 MHz CPU speed and 2 Cores. I have to perform such "compression" because I transfer about 18MB/s over Ethernet and that gives me huge overhead. I've tested various compression algorithms such Snappy, LZ4 and none of that was even close to achieved 1 ms with saturation and bits schifting.

I've written the following code:

#pragma pack(push, 1)
typedef struct {
        union {
                struct {
                        uint32_t value0_24x1:24;
                };
                struct {
                        uint32_t value0_12x1:12;
                        uint32_t value1_12x1:12;
                };
                struct {
                        uint32_t value0_8x1:8;
                        uint32_t value1_8x1:8;
                        uint32_t value3_8x1:8;
                };
        };
} uint24_t;
#pragma pack(pop)


static inline uint32_t __attribute__((always_inline)) saturate(uint32_t value)
{
        register uint32_t result;

        asm volatile("usat %0, %2, %1 \n\t"                     \
                : [result] "=r" (result)                        \
                : [value] "r" (value), [saturate] "I" (12)      \
                :                                               \
                );

        return result;
}

void __attribute__((noinline, used)) compact(const uint16_t *input, uint24_t *output, uint32_t elements)
{
#if 0
        /* More readable, but slower */
        for (uint32_t i = 0; i < elements; ++i) {
                output->value0_12x1 = saturate(*input++);
                (output++)->value1_12x1 = saturate(*input++);
        }
#else
        /* Alternative - less readable but faster */
        for (uint32_t i = 0; i < elements; ++i, input += 2)
                (output++)->value0_24x1 = saturate(*input) | ((uint32_t)saturate(*(input+1))) << 12;
#endif
}

static uint16_t buffer_in[76800] = {0};
static uint24_t buffer_out[38400] = {0};

int main()
{
    /* Dividing by 2 because we process two input values in a single loop inside compact() */
    compact(buffer_in, buffer_out, sizeof(buffer_in) / sizeof(buffer_in[0]) / 2);

    return 0;
}

And it's Assembly:

248 00008664 <compact>:
249     8664:   e92d4010    push    {r4, lr}
250     8668:   e3a03000    mov r3, #0
251     866c:   ea00000c    b   86a4 <compact+0x40>
252     8670:   e1d040b0    ldrh    r4, [r0]
253     8674:   e6ec4014    usat    r4, #12, r4
254     8678:   e1d0c0b2    ldrh    ip, [r0, #2]
255     867c:   e6ecc01c    usat    ip, #12, ip
256     8680:   e184c60c    orr ip, r4, ip, lsl #12
257     8684:   e2833001    add r3, r3, #1
258     8688:   e2800004    add r0, r0, #4
259     868c:   e5c1c000    strb    ip, [r1]
260     8690:   e7e7445c    ubfx    r4, ip, #8, #8
261     8694:   e7e7c85c    ubfx    ip, ip, #16, #8
262     8698:   e5c14001    strb    r4, [r1, #1]
263     869c:   e5c1c002    strb    ip, [r1, #2]
264     86a0:   e2811003    add r1, r1, #3
265     86a4:   e1530002    cmp r3, r2
266     86a8:   1afffff0    bne 8670 <compact+0xc>
267     86ac:   e8bd8010    pop {r4, pc}

Compiled using GCC 4.6.3 with the following CFLAGS:

-Os (-O2 and -O3 do not give any noticable improvements)

-march=armv7-a -mcpu=cortex-a9 -mtune=cortex-a9

-marm -mfloat-abi=softfp -mfpu=neon funsafe-math-optimizations

Benchmark has shown that we're using ~10.3 cycles per 1 data convertion.

The questions are:

Can I use NEON to improve the performance?
Can someone give me some hints regardles NEON? What intrinsics shall I use?

Some code example would be very welcome, because I'm completly noob when it comes to NEON.

You can use NEON to perform the saturation part - that should be quite simple. The 16->12 packing though is much trickier with SIMD - I've done it with SSE on x86, but I'm not sure whether NEON has the required capabilities. — Paul R
The memory bandwidth can be just as important. I would benchmark a memcpy(); see Cortex-A8 fastest memory copy; the concepts are the same whether using NEON or just usat. The pld can get data to L1/L2 cache quicker. Also, a pldw may help with writes. Your cache line is 8 words (32bits), so processing this amount and ensuring it is aligned will do as much as the op codes used. — artless noise
While it won't make the asm code any faster, it might be easier to read as "usat %[result], %[saturate], %[value]". I would also remove the volatile, since this could (under certain circumstances) force this code to be executed more than necessary. — David Wohlferd
@artless noise Some time ago I benchmarked memcpy() and the results were better with stock one provided by eglibc 2.14 (afair) than all of that implementations proposed by infocenter.arm for Cortex-A8. I do not remember details but the one I have uses just pld, ldmia and stmia in general (and I was told that this is the best choice for Cortex-A9). If I'm wrong I'll correct myself on Monday when I'll be back at home. Btw you're right about using pld together with usat - I had better results. — Piotr Nowak

Jake 'Alquimista' LEE Jake 'Alquimista' LEE · Accepted Answer · 2014-06-18T14:07:46

Here are the answers :

Yes, it will be blazingly fast.
You should avoid intrinsics at all costs. It isn't worth the effort. Go for assembly

I'll give you a sample implementation once I arrive home.

////////////////////////////////////////////////////

Ok, here it goes : You want to pack 16 bits to 12 bits. It's a ratio of 4:3.

Therefore, it's wise to load data 4 spread and store them 3 spread : vld4.16 -> vst3.16

/*
*   void fanic_pack16to12(unsigned short * pDst, unsigned short * pSrc, unsigned int count);
*   assert :
*       count >= 64
*       count % 4 == 0
*
*   written by : Jake Lee
*   part of FANIC project - Fastest ARM NEON Implementation Challenge
*/
    pDst .req r0
    pSrc .req r1
    count .req r2

    .text
    .arm
    .global fanic_pack16to12:

    .func
    .align 5
fanic_pack16to12:
    pld     [pSrc]
    pld     [pSrc, #64]
    pld     [pSrc, #128]
    pld     [pSrc, #192]
    pld     [pSrc, #256]
    sub     count, count, #64

    .align 5
1:
    vld4.16     {d16, d18, d20, d22}, [pSrc]!
    vld4.16     {d17, d19, d21, d23}, [pSrc]!
    vld4.16     {d24, d26, d28, d30}, [pSrc]!
    vld4.16     {d25, d27, d29, d31}, [pSrc]!
    pld     [pSrc, #128]
    pld     [pSrc, #192]
    subs    count, count, #64

    vqshl.u16   q0, q8, #4
    vqshl.u16   q3, q9, #4
    vqshl.u16   q8, q10, #4
    vqshl.u16   q9, q11, #4
        vqshl.u16   q10, q12, #4
        vqshl.u16   q13, q13, #4
        vqshl.u16   q14, q14, #4
        vqshl.u16   q15, q15, #4
    vshl.u16    q1, q3, #4
    vshl.u16    q2, q8, #8
        vshl.u16    q11, q13, #4
        vshl.u16    q12, q14, #8
    vsri.16     q0, q3, #12
    vsri.16     q1, q8, #8
    vsri.16     q2, q9, #4
        vsri.16     q10, q13, #12
        vsri.16     q11, q14, #8
        vsri.16     q12, q15, #4

    vst3.16     {d0, d2, d4}, [pDst]!
    vst3.16     {d1, d3, d5}, [pDst]!
    vst3.16     {d20, d22, d24}, [pDst]!
    vst3.16     {d21, d23, d25}, [pDst]!
    bpl     1b

    cmp     count, #-64
    add     pDst, pDst, count

    bxle    lr

    add     pSrc, pSrc, count, lsl #1
    add     pDst, pDst, count, asr #1
    b       1b
     .endfunc
     .end

Please note how many cycles and bandwidth are saved thanks to smart register allocation and loop control - practices that are simply impossible with intrinsics.

This implementation will run so fast as if done by a dedicated hardware.

There is absolutely no pipeline hazard.
Roughly 50 cycles / iteration = less than 1 cycle / data

Have fun!

//////////////////////////////////////////////////////

Ok, below is the unpacking function :

/*
*   void fanic_unpack12to16(unsigned short *pDst, unsigned short *pSrc, unsigned int count);
*   assert :
*       count >=64
*       count % 4 == 0
*   
*   written by : Jake Lee
*   part of FANIC project - Fastest ARM NEON Implementation Challenge
*/
    pDst .req r0
    pSrc .req r1
    count .req r2

    .text
    .arm
    .global fanic_unpack12to16:

    .func
    .align 5
fanic_unpack12to16:

    pld [pSrc]
    pld [pSrc, #64*1]
    pld [pSrc, #64*2]
    vpush       {q4}
    pld [pSrc, #64*3]
    vmov.i16    q4, #0x0fff
    pld [pSrc, #64*4]
    sub count, count, #64

    .align 5
1:
    vld3.16     {d20, d22, d24}, [pSrc]!
    vld3.16     {d21, d23, d25}, [pSrc]!
    vld3.16     {d26, d28, d30}, [pSrc]!
    vld3.16     {d27, d29, d31}, [pSrc]!
    pld     [pSrc, #128]
    pld     [pSrc, #192]
    subs    count, count, #64

    vshr.u16    q1, q11, #8
    vshr.u16    q2, q12, #12
    vshr.u16    q0, q10, #4
    vand        q3, q12, q4
        vshr.u16    q9, q14, #8
    vsli.16     q1, q10, #8
    vsli.16     q2, q11, #4
        vshr.u16    q10, q15, #12
        vsli.16     q9, q13, #8
    vbic.i16    q1, q1, #0xf000
    vbic.i16    q2, q2, #0xf000
    vsli.16     q10, q14, #4
    vshr.u16    q8, q13, #4
    vbic.i16    q9, q9, #0xf000
    vand        q11, q15, q4
    vbic.i16    q10, q10, #0xf000

    vst4.16     {d0, d2, d4, d6}, [pDst]!
    vst4.16     {d1, d3, d5, d7}, [pDst]!
    vst4.16     {d16, d18, d20, d22}, [pDst]!
    vst4.16     {d17, d19, d21, d23}, [pDst]!
    bpl     1b

    cmp     count, #-64
    add     pSrc, pSrc, count

    vpople      {q4}
    bxle    lr

    add     pSrc, pSrc, count, asr #1
    add     pDst, pDst, count, lsl #1
    b       1b

    .endfunc
    .end

Tweak points :

force-align both src and dst to 64 bytes for maximum bandwidth efficiency
then guarantee all the memory related instructions alignments. 256bit for 4 spread, 64bit for 3 spread like following :

vld4.16 {d16, d18, d20, d22}, [pSrc,:256]!

..

vst3.16 {d0, d2, d4}, [pDst,:64]!

..
make count a multiple of 64. otherwise, you'll have to write extra codes dealing with residual data (the current one would crash due to alignment fault)
you may increase/decrease the pld offsets by 64 for possibly increased cache hit rate

This will improve the performance by a good margin if not huge.

Compacting data in buffer from 16 bit per element to 12 bits

3 Answers