2
votes

In the description of scalar types for gpb proto2 (https://developers.google.com/protocol-buffers/docs/proto#scalar) it says:

  • int32

    Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.

  • sint32

    Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.

Will the sint32 be equally efficient for positive values as the int32?

In other words, is there any reason to use int32?

If it matters what language is used, i'm only interested in C++.

2
Do you need variable length integers? If not, fixed length might be better.tadman
Yes, variable length is desired.Martin G

2 Answers

3
votes

https://developers.google.com/protocol-buffers/docs/encoding#signed-integers

Signed varints are encoded by alternating between positive and negative values. For example,

   value     int32    zigzag    sint32
          (binary)            (binary)
       0  00000000         0  00000000
      -1  11111111         1  00000001
          11111111
          11111111
          11111111
          00001111
       1  00000001         2  00000010
      -2  11111110         3  00000011
          11111111
          11111111
          11111111
          00001111
...
      63  00111111       126  01111110
     -64  11000000       127  01111111
          11111111
          11111111
          11111111
          00001111
      64  01000000       128  10000000
                              00000001
...

On average, a positive number will require one more bit to be encoded as an sint than as an int.

(Expand and run the following snippet for a live demo.)

function encode_varint(number) {
  if (!number) return [0];
  var bytes = [];
  while (number) {
    var byte = number & 0x7F;
    number >>>= 7;
    if (number) byte |= 0x80;
    bytes.push(byte);
  }
  return bytes;
}
function format_bytes(bytes) {
  var output = '';
  for (var i = 0; i < bytes.length; i++) {
    if (i) output += ' ';
    output += bytes[i].toString(2).padStart(8, '0');
  }
  return output;
}
var valueElem = document.getElementById('value');
var int32Elem = document.getElementById('int32');
var sint32Elem = document.getElementById('sint32');
function update() {
  var value = parseInt(valueElem.value);
  var int32 = encode_varint(value);
  var sint32 = encode_varint(value << 1 ^ -(value < 0));
  int32Elem.value = format_bytes(int32);
  sint32Elem.value = format_bytes(sint32);
}
valueElem.addEventListener('change', update);
update();
#varint {
  display: grid;
  grid-template-columns: max-content auto;
  grid-row-gap: 1ex;
  grid-column-gap: 1ch;
}
#varint label {
  text-align: right;
}
#varint input {
  font-family: monospace;
}
<form id='varint' onsubmit='return false'>
  <label for='value'>value</label>
  <input id='value' type='number' value='0'>
  <label for='int32'>int32</label>
  <input id='int32' type='text' readonly>
  <label for='sint32'>sint32</label>
  <input id='sint32' type='text' readonly>
</form>
2
votes

Some positive values will be smaller as int32 than as sint32, but only by one byte at most.

Bascially, protobuf uses a "base128" encoding for numbers, where each byte of the encoded value has 7 bits of value data and one bit as the 'end' marker to find the end of the encode value. "int32" treats the number as a 32 bit twos-complement and encodes that as an unsigned 32-bit number, so negative values will be encoded as large positive values and always require 5 bytes. "sint32" on the other hand encodes as a weird sign-magnitude style encoding with the sign bit transposed to the least significant bit (why they didn't just use normal 2s complement is a mystery -- would be just as compact, but simpler to decode/encode).

The upshot is that a single byte int32 can represent numbers in the range 0-127, while a single byte sint32 represents numbers in the range -64..63. So value 64..127 will require 2 bytes as sint32 and only one as int32