37
votes

What are the most noticable differences between Google Protocol Buffers and ASN.1 (with PER-encoding)? For my project the most imporant issue is the size of the serialized data. Has anyone done any data-size comparisons between the two?

4
Perhaps a related question: why do we need protocol buffers when we already have a mature ASN.1? Not invented here syndrome at Google?coder.in.me

4 Answers

24
votes

If you use ASN.1 with Unaligned PER, and define your data types using the appropriate constraints (e.g., specifying lower/upper bounds for integers, upper bounds for the length of lists, etc.), your encodings will be very compact. There will be no bits wasted for things like alignment or padding between the fields, and each field will be encoded in the minimum number of bits necessary to hold its permitted range of values. For example, a field of type INTEGER (1..8) will be encoded in 3 bits (1='000', 2='001', ..., 8='111'); and a CHOICE with four alternatives will occupy 2 bits (indicating the chosen alternative) plus the bits occupied by the chosen alternative. ASN.1 has many other interesting features that have been successfully used in many published standards. An example is the extension marker ("..."), which when applied to SEQUENCE, CHOICE, ENUMERATED, and other types, enables backward- and forward compatibility between endpoints implementing different versions of the specification.

13
votes

It's a long time since I've done any ASN.1 work, but the size is very likely to depend on the details of your types and actual data.

I would strongly recommend that you prototype both and put some real data in to compare.

If your protocol buffer would contain repeated primitive types, you should look at the latest source in Subversion for Protocol Buffers - they can be represented in a "packed" format now which is much more space-efficient. (My C# port has just caught up with this feature, some time last week.)

6
votes

When size of the packed/encoded message is important you should also note the fact that protobuf is not able to pack repeated fields that are not of a primitive numeric type, read this for more information.

This is an issue e.g. if you have messages of that type: (comment defines actual range of values)

message P{
    required sint32 x = 1; // -0x1ffff  to  0x20000
    required sint32 y = 2; // -0x1ffff  to  0x20000
    required sint32 z = 3; // -0x319c  to   0x3200
}
message Array{
    repeated P ps = 1;
    optional uint32 somemoredata = 2;
}

In case you have an array length of, e.g., 32 than you would result in a packed message size of approximately 250 to 450 bytes with protobuf, depending on what values the array actually contains. This can even increase to over 1000 bytes in case you use the full 32bit range or in case you use int32 instead of sint32 and have negative values.

The raw data blob (assuming that z can be defined as int16 value) would only consume 320 bytes and thus the ASN.1 message is always smaller than 320 bytes since the max values are actually not 32bit but 19bit (x,y) and 15bit (z).

The protobuf message size can be optimized with this message definition:

message Ps{
    repeated sint32 xs = 1 [packed=true];
    repeated sint32 ys = 2 [packed=true];
    repeated sint32 zs = 3 [packed=true];
}
message Array{
    required Ps ps = 1;
    optional uint32 somemoredata = 2;
}

which results in message sizes between approximately 100 byte (all values are zeros), 300 byte (values at range max), and 500 byte (all values are high 32bit values).

3
votes

Protocol Buffers does not guarantee preservation of the order of fields in the binary encoding but ASN.1 does. It is not related to size so might not be the most noticeable in your use case but it is an important difference for comparison, for digital signatures, for simplified parsing, and possibly other applications.