Kafka string serialization efficiency

Question

I am new to Kafka and am trying to store messages with the least memory overhead, so want to avoid field names in my encoding (ie. JSON). Consider a message with three variable length String fields,

Interface IMessage:
   String getA()
   String getB()
   String getC()

Since Kafka includes a default String Serializer, the easiest way to encode would be to simply concatenate and delimit the fields. Something like,

String encoded = "FieldA|FieldB|FieldC"

Under the hood, Kafka will convert this to a byte array.

My question is, will kafka use Java's default UTF-8 encoding such that each ASCII character in my string only take up one byte? In other words, will a 15 character string take up 15 bytes in Kafka's memory? Or is it more efficient for some reason to call toBytes() in Java and pass the bytearray directly into ByteArraySerializer?

byte[] encoded = "FieldA|FieldB|FieldC".toBytes()

Iłya Bursov Iłya Bursov · Accepted Answer · 2017-04-13T20:11:25

Documentation for this class states

String encoding defaults to UTF8 and can be customized by setting the property key.serializer.encoding, value.serializer.encoding or serializer.encoding. The first two take precedence over the last.

So, default encoding is UTF-8 as you need it.

Also, you can download sources and find:

private String encoding = "UTF8";

@Override
public void configure(Map<String, ?> configs, boolean isKey) {
    String propertyName = isKey ? "key.serializer.encoding" : "value.serializer.encoding";
    Object encodingValue = configs.get(propertyName);
    if (encodingValue == null)
        encodingValue = configs.get("serializer.encoding");
    if (encodingValue != null && encodingValue instanceof String)
        encoding = (String) encodingValue;
}

So, sources match documentation, which is good.

If you want to be sure you can define key.serializer.encoding and value.serializer.encoding to be UTF8

Kafka string serialization efficiency

1 Answers