6
votes

What are the trade-offs, advantages and disadvantages of each of these streaming implementations where multiple messages of the same type are encoded?

Are they any different at all ? What I want achieve is to store a vector of box'es, into a protobuf.

Impl 1 :

package foo;

message Boxes
{ 
  message Box 
  { required int32 w = 1;
    required int32 h = 2;
  }

  repeated Box boxes = 1; 
}

Impl 2:

package foo;

message Box 
{ required int32 w = 1;
  required int32 h = 2;
}

message Boxes 
{ repeated Box boxes = 1; 
}

Impl 3 : Stream multiple of these messages into the same file.

package foo;

message Box 
{ required int32 w = 1;
  required int32 h = 2;
}
2

2 Answers

11
votes

Marc Gravell answer is certainly correct, but one point he missed is

  • option's 1 & 2 (Repeated option) will serialise / deserialise all the box's at once
  • option 3 (multiple messages in the file) will serialise / deserialise box by box. If using java, you can use delimited files (which will add a Var-Int length at the start of the message).

Most of the time it will not matter wether you use a Repeated or Multiple messages, but if there are millions / billions of box's, memory will be an issue for option's 1 and 2 (Repeated) and option 3 (multiple messages in the file) would be the best to choose.

So in summary:

  • If there millions / billions of Boxes use - Option 3 (multiple messages in the file).
  • Otherwise use one of the Repeated options (1/2) because it simpler and supported across all Protocol buffers versions.

Personally I would like to see a "standard" Multiple Message format

8
votes

1 & 2 only change where / how the types are declared. The work itself will be identical.

3 is more interesting: you can't just stream Box after Box after Box, because the root object in protobuf is not terminated (to allow concat === merge). If you only write Boxes, when you deserialize you will have exactly one Box with the last w and h that were written. You need to add a length-prefix; you could do that arbitrarily, but: if you happen to choose to "varint"-encode the length, you're close to what the repeated gives you - except the repeated also includes a field-header (field 1, type 2 - so binary 1010 = decimal 10) before each "varint" length.

If I were you, I'd just use the repeated for simplicity. Which of 1 / 2 you choose would depend on personal choice.