4
votes

We are writing a custom sink connector for writing content of a topic with avro messages to a CEPH storage.

To do this we are provided with SinkRecords which have a Kafka Connect schema which is a mapped version of our avro schema. Since we want to write avro to CEPH, we use the connect API methods to convert the Connect schema back to Avro. Why do we need to do this? What are the benefits of introducing Kafka Connect Schema and not using the more commonly adapted Avro Schema?

FYI: I am asking this because we have some issues with Avro unions. Their mappings to the Kafka Connect Schema still has some issues, e.g. https://github.com/confluentinc/schema-registry/commit/31648f0d34b10c1b36e8ec6f4c1236ed3fe86495#diff-0a8d4f17f8d4a68f2f0d2dcd9211df84

1

1 Answers

8
votes

Kafka Connect defines its own schema structure because the framework isolates connectors from any knowledge of how the messages are serialized in Kafka. This makes it possible to use any connector with any Converter. Without this separation, then connectors would expect the messages to be serialized in a particular form, making them harder to reuse.

If you know that all messages are serialized with a specific Avro schema, you can always configure your sink connector to use the ByteArrayConverter for keys and values, and then your connector can deal with the messages in the serialized form.

However, be aware that if the messages are serialized using Confluents Avro serializer (or Avro Converter in a source connector), then the binary form of the keys and values will include the magic byte and Avro schema identifier in the leading byte(s). The remaining content of the byte arrays will be the Avro serialized form.