I’m trying to find some info about the performance and (dis)advantages of using two different Avro types for sending Kafka messages. According to my research one can create an avro-based Kafka message's payload as:
EITHER:
GenericRecord whose instance can be created by calling new GenericData.Record and passing a schema read from the Schema Registry as a parameter:
Roughly:
private CachedSchemaRegistryClient schemaRegistryClient;
private Schema valueSchema;
// Read a schema
//…
this.valueSchema = schemaRegistryClient.getBySubjectAndID("TestTopic-value",1);
// Define a generic record according to the loaded schema
GenericData.Record record = new GenericData.Record(valueSchema);
// Send to kafka
ListenableFuture<SendResult<String, GenericRecord>> res;
res = avroKafkaTemplate
.send(MessageBuilder
.withPayload(record)
.setHeader(KafkaHeaders.TOPIC, TOPIC)
.setHeader(KafkaHeaders.MESSAGE_KEY, record.get("id"))
.build());
OR:
A class that extends the SpecificRecordBase and is generated with the help of Maven (from a file containing an Avro schema)
/..
public class MyClass extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord
/..
MyClass myAvroClass = new MyClass();
ListenableFuture<SendResult<String, MyClass>> res;
res = avroKafkaTemplate
.send(MessageBuilder
.withPayload(myAvroClass)
.setHeader(KafkaHeaders.TOPIC, TOPIC)
.setHeader(KafkaHeaders.MESSAGE_KEY, myAvroClass.getId())
.build());
When a piece of code that contains an instance of a class that extends GenericRecord is debugged, one can see that there’s a schema included.
On that account I have a few questions:
If I send a GenericRecord instance to Kafka, is the underlying schema also sent?
If no, when is it dropped? Which class / method is responsible for extracting bytes from GenericRecord and dropping the underlying schema so that it is not sent together with the payload? If yes, what’s the point of the schema registry whatsoever?In case of a class that extends the SpecificRecord, the underlying schema is also sent, isn't it? It means that, if I took a function which receives a Kafka message and counts the number of its bytes, I should expect more bytes in a specific-record-message than in a generic-record-message, right?
A SpecificRecord instance gives me more control, and the usage is less error-prone. If a schema is not sent with GenericRecord and is with SpecificRecord, then we have a trade-off. On the one hand (SpecificRecord), there is a simplicity of usage since clear API is available (one doesn't have to know all fields by heart, and write get("X"), get("Y") etc.), on the other hand the payload's size increases since the schema has to be sent with it. If I have a relatively big schema (50 fields), I should opt for sending GenericRecords with the help of the Schema Registry, otherwise the performance will be impacted negatively because the schema has to be sent with every message, correct?