2
votes

Unsigned integers can be compressed by using "bit-packing" techniques: Within a block of unsigned integers only the significant bits are stored, resulting in data compression when all integers in a block are "small". The method is known as FOR (frame of reference).

There are SIMD libraries that do this very efficiently.

Now I want to use FOR-like techniques to encode signed integers, e.g. from a differenced sequence of unsorted unsigned integers. The sign of each signed integer needs to be stored somewhere, there are two options:

  1. Store the signs in a separate block of data. This adds overhead.
  2. Store the sign together with the absolute value of each signed integer.

I'm following path 2 right now. The 2-s complement has the sign bit in the msb (most signfificant bit), so that won't work for bit-packing à la FOR. One possibility is to store the sign in the lsb (least significant bit). Storing signed integers this way is very unusual, there are no instruction that support this, as far as I know. The question now is: Can these lsb-signed-integers be encoded/decoded efficiently using SIMD instructions?

I think AVX-512 _mm_testn_epi32_mask can be used to extract the lsb from each uint32, followed by a shift, then two mask_extract of some sort? Quite convoluted.

1
A common and reversible mapping from signed to unsigned integers is: (x < 0) ? (-2 * x - 1) : (2 * x). Could you make use of that? - njuffa
AVX-512 has SIMD rotate of 32-bit elements, if that helps. (felixcloutier.com/x86/vprold:vprolvd:vprolq:vprolvq). So you could bring the sign-bit to the bottom. But you'd then I guess want to mask away all the other leading ones for a negative number. Perhaps you can bit-flip or negate for negative inputs, so end up with leading zeros, and can use the low bit as a signal to negate or not when decoding. (0-x after a rotate has the advantage of leaving the low bit set for negative numbers, unlike x ^ -1.) I think this is the same mapping @njuffa suggested. - Peter Cordes
Yeah, so encode with vprold zmm, zmm, 1 / vpabsd zmm, zmm (absolute value of 32-bit elements). Decode with ... I'll post an answer when I think of something, if nobody beats me to it :P - Peter Cordes
A related question was asked 10 years ago: stackoverflow.com/questions/3557599/… - mcmayer
To implement @njuffa's suggestion, note that -x - 1 is the same as ~x, so (x < 0) ? (-2 * x - 1) : (2 * x) is the same as (x>>31) ^ (x+x). - chtz

1 Answers

2
votes

Untested ZigZag examples in C using SSE2 for 64-bit integers:

(note: SSE2 is missing some 64-bit instructions...)

#include <emmintrin.h>


// from comment by Peter-Cordes 
__m128i zigzag_encode_epi64(__m128i v) {
    __m128i signmask = _mm_shuffle_epi32(v, _MM_SHUFFLE(3,3,1,1));
    signmask = _mm_srai_epi32(signmask, 31);
    return _mm_xor_si128(_mm_add_epi64(v, v), signmask);
}


__m128i zigzag_decode_epi64 (__m128i v) {
    __m128i signmask = _mm_and_si128(_mm_set_epi32(0, 1, 0, 1), v);
    signmask = _mm_sub_epi64(_mm_setzero_si128(), signmask);
    return _mm_xor_si128(_mm_srli_epi64(v, 1), signmask);
}


// no constant
__m128i zigzag_decodev3_epi64 (__m128i v) {
    __m128i t = _mm_srli_epi64(v, 1);
    __m128i signmask = _mm_sub_epi64(_mm_slli_epi64(t, 1), v);
    return _mm_xor_si128(t, signmask);
}

Zigzag is good for bitwise varints. However, a bytewise group-varint may wish to "sign extend from a variable bit-width".


32-bit examples

I favored compares over arithmetic shifts. I assume - when unrolled - compares will have 1 cycle lower latency.

__m128i zigzag_encode_epi32 (__m128i v) {
    __m128i signmask =_mm_cmpgt_epi32(_mm_setzero_si128(), v);
    return _mm_xor_si128(_mm_add_epi32(v, v), signmask);
}


__m128i zigzag_decode_epi32 (__m128i v) {
    const __m128i m = _mm_set1_epi32(1);
    __m128i signmask =_mm_cmpeq_epi32(_mm_and_si128(m, v), m);
    return _mm_xor_si128(_mm_srli_epi32(v, 1), signmask);
}


__m128i delta_encode_epi32 (__m128i v, __m128i prev) {
    return _mm_sub_epi32(v, _mm_alignr_epi8(v, prev, 12));
}


// prefix sum (see many of answers around stackoverflow...)
__m128i delta_decode_epi32 (__m128i v, __m128i prev) {
    prev = _mm_shuffle_epi32(prev, _MM_SHUFFLE(3,3,3,3)); // [P P P P]
    v = _mm_add_epi32(v, _mm_slli_si128(v, 4)); // [A AB BC CD]
    prev = _mm_add_epi32(prev, v); // [PA PAB PBC PCD]
    v = _mm_slli_si128(v, 8); // [0 0 A AB]
    return _mm_add_epi32(prev, v); // [PA PAB PABC PABCD]
}


__m128i delta_zigzag_encode_epi32 (__m128i v, __m128i prev) {
    return zigzag_encode_epi32(delta_encode_epi32(v, prev));
}


__m128i delta_zigzag_decode_epi32 (__m128i v, __m128i prev) {
    return delta_decode_epi32(zigzag_decode_epi32(v), prev);
}

Note: Delta coding would be faster (round-trip/decoding) to transpose the elements while encoding then transpose them back again during decoding; horizontal prefix sums are really slow. However, determining the optimum number of elements to transpose in each batch seems like a hard problem.