0
votes

I have been wondering about the need for an Avro Schema Registry when consuming messages from a Kafka topic using a statically typed language, like Java. I'm consuming messages from a Kafka topic setup like this:

    Properties props = new Properties();
    props.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, String.join(",", kafkaProperties.getServers()));
    props.setProperty(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, ByteArrayDeserializer.class.getName());
    props.setProperty(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, KafkaAvroDeserializer.class.getName());

    props.setProperty(KafkaAvroDeserializerConfig.SCHEMA_REGISTRY_URL_CONFIG, kafkaProperties.getSchemaRegistryUrl());
KafkaConsumer<byte[], FooClass> kafkaConsumer = new KafkaConsumer<>(props);;

In my project I have .avsc files that define the schema for FooClass class. I have also configured the avro-maven-plugin to generate the class FooClass for me at build time.

Why do I still need to specify a Schema Registry URL? Isn't my consumer able to deserialize the values of my Kafka messages using the .avsc file in my project?

3

3 Answers

1
votes

You're using Confluent libraries (io.confluent.kafka.serializers.KafkaAvroDeserializer) which define their own Confluent Avro format and mandate the use of the Confluent Schema Registry.

Technically, you don't need a registry for Apache Avro.

Avro needs the writers schema to decode the message, and while this is included in Avro files, making them self-describing, it's not included in the streaming format or Confluent Avro.

So, the client needs some way to look up the schema. This is either solved by the Confluent Schema Registry for the Confluent Avro format or could be solved by your own org.apache.avro.message.SchemaStore. See this example, where I use a SchemaStore.Cache pre-filled with known schemata.

Note that the example uses the Apache Avro format, which is incompatible with Confluent Avro.

The Confluent Avro deserializer needs a Confluent Schema Registry and has no API for „run with known schemata“.

0
votes

The purpose of Schema Registry is to make schemas available to all producers and consumers, without needing to be tied together by the distribution and management of something like an .avsc file. A file like this is fine within a standalone project but Kafka is frequently used by multiple applications, perhaps across teams or even organisational units - and so being able to more loosely-couple how a schema is shared is important.

Ref: https://docs.confluent.io/current/schema-registry/index.html

0
votes

After learning more about the avro format and the role of a schema registry, I realized why a schema registry is needed even for a statically-typed language like java. And the short answer is "schema evolution".

So let's say you build an app today that consumes messages of type A, written with schema SA. At the time you are building your app you may have an "a.avsc" file that you use to generate classes to deserialize messages to. Up to this point you wouldn't think there's a need to contact a schema registry to get SA and it would make sense to point your deserializer class to the "a.avsc" file that you build your application with. But with the avro deserializer you can't do that (i.e it needs a registry). Which makes you wonder, why?

A week later, the producer that's producing messages of type A decides to add a new field to A. When this happens, your avro deserializer class, using the schema you built your application with (if that was possible), will not be able to deserialize these new messages if it encounters one. At the same time the code that was generated using the old schema would still work (if the schema change is backward compatible). But for your deserializer to be able to read messages written with the new schema, it needs the new schema.

What this means in effect, is that your Java application code generated using the old schema will still work with messages written using an evolved schema. But without the new schema (provided by the registry) your avro deserializer will not be able to deserialize new messages.

So, in theory, if schemas didn't change you would be able to get away with providing the schema at built time.