0
votes

We are using Apache Kafka for communication between micro-services. We have an Avro schema which needs to be expanded by adding a new field. In order to be compatible, we have to add a default value to the new field. However from the application logic perspective, it makes no sense to have a default value. New consumers need to be able to consume only the schema with the new field required from the start.

We would like to deprecate the old schema but still leave a short period of time for old consumers to adjust, while having new consumers ignore messages without the required fields. I read that Kafka generally allows for two completely separate schemas to be sent to the same topic and for consumers to choose which schema to consume, but I guess this is not possible if you're using Avro? Is there a way to have those two incompatible schemas for the same topic for a time without adding any default values? Old schema would be removed after a time and there would be no producers producing the old schema after deprecation period passes.

Work around would be to have separate topics, but I would like to avoid that if possible.

EDIT: Possibly I was not clear enough with the question.

Is it possible to have two completely unconnected schemas, with no matching fields on same topic?

For example let's say user & purchase tracking. Specific service application logic would handle all the implications of such a system. Some services do not care about users, but only record purchases for financial reporting, others just track users, maybe for some kind of user management or notifications, while third group cares about both types and needs to know for sure, that user existed and was active at a time when they wanted to makes purchase. Here I would need 2 loosely related schemas (related only in the sense that purchase needs a foreign key to the user, but this is application logic relation, not schema relation). Obviously these are two separate objects and no defaults make sense in order to make schemas compatible. Each micro-service would decide what schemas to consume from the one topic, but the order of messages would still be crucial.

Further EDIT:

After further research I believe io.confluent.kafka.serializers.subject.strategy.TopicRecordNameStrategy should do the trick, as OneCricketeer suggested, I just need to figure out how to configure it

This is a relevant section of my application.yml so far:

spring:
...
  cloud:
    stream:
      schemaRegistryClient:
        endpoint: http://localhost:8081
      default:
        producer:
          useNativeEncoding: true
        consumer:
          useNativeEncoding: true
      kafka:
        binder:
          brokers: http://localhost:9092
            producer-properties:
              key.serializer: io.confluent.kafka.serializers.KafkaAvroSerializer
              value.serializer: io.confluent.kafka.serializers.KafkaAvroSerializer
              schema.registry.url: http://localhost:8081
            consumer-properties:
              key.deserializer: io.confluent.kafka.serializers.KafkaAvroDeserializer
              value.deserializer: io.confluent.kafka.serializers.KafkaAvroDeserializer
              schema.registry.url: http://localhost:8081
              specific.avro.reader: true
      bindings:
        foo:
          destination: foo
          content-type: application/*+avro
      ...

So where do I set the subject strategy?

1
You could disable backwards compatibility completely, or set it to forward compatibility... Then you don't need a defaultOneCricketeer
This carries different kind of risks and we would like to avoid having to do that.Ajant

1 Answers

1
votes

I read that Kafka generally allows for two completely separate schemas to be sent to the same topic and for consumers to choose which schema to consume, but I guess this is not possible if you're using Avro?

I think you're referring to the Strategy options in the serializer classes here, which are definitely applicable to Avro, but I don't think will solve your mentioned problem unless you started with the RecordName option to begin with.
This would also mean you'd need two separate classes (the second with the required field) rather than modifying one to add a field, which I think would cause you to be unable to consume the "original" records because it'd only know about the schema of the newer record type. There might be ways to manually register the schema to work around that, though.

The registry otherwise doesn't have the concept of depreciation policies or schema TTL. If you want to force upgrades, you can delete the schema and then clients will start discovering 404 errors.

New consumers need to be able to consume only the schema with the new field required from the start

The point of having defaults is so that consumers can decide what to do with records that never had that field - for example if your consumer group got wiped, and you started "from the start" of the topic.
Defaults aren't applied and sent for producer serialization, but I cannot recall if that means the field is required to be added when building the record, or not.

So, two ways I see to address the problem

  1. Set an impossible default value, and check for it in the consumer, and skip/throw an exception for anytime you get that
  2. Empty the topic or use a new one where it's not possible for any record there to refer a schema with missing fields, meaning all producers need upgraded first