9
votes

I have a spring application that is my kafka producer and I was wondering why avro is the best way to go. I read about it and all it has to offer, but why can't I just serialize my POJO that I created myself with jackson for example and send it to kafka?

I'm saying this because the POJO generation from avro is not so straight forward. On top of it, it requires the maven plugin and an .avsc file.

So for example I have a POJO on my kafka producer created myself called User:

public class User {

    private long    userId;

    private String  name;

    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public long getUserId() {
        return userId;
    }

    public void setUserId(long userId) {
        this.userId = userId;
    }

}

I serialize it and send it to my user topic in kafka. Then I have a consumer that itself has a POJO User and deserialize the message. Is it a matter of space? Is it also not faster to serialize and deserialize this way? Not to mention that there is an overhead of maintaining a schema-registry.

3

3 Answers

9
votes

You don't need AVSC, you can use an AVDL file, which basically looks the same as a POJO with only the fields

@namespace("com.example.mycode.avro")
protocol ExampleProtocol {
   record User {
     long id;
     string name;
   }
}

Which, when using the idl-protocol goal of the Maven plugin, will create this AVSC for you, rather than you writing it yourself.

{
  "type" : "record",
  "name" : "User",
  "namespace" : "com.example.mycode.avro",
  "fields" : [ {
    "name" : "id",
    "type" : "long"
  }, {
    "name" : "name",
    "type" : "string"
  } ]
}

And it'll also place a SpecificData POJO User.java on your classpath for using in your code.


If you already had a POJO, you don't need to use AVSC or AVDL files. There are libraries to convert POJOs. For example, you can use Jackson, which is not only for JSON, you would just need to likely create a JacksonAvroSerializer for Kafka, for example, or find if one exists.

Avro also has built-in library based on reflection.


So to the question - why Avro (for Kafka)?

Well, having a schema is a good thing. Think about RDBMS tables, you can explain the table, and you see all the columns. Move to NoSQL document databases, and they can contain literally anything, and this is the JSON world of Kafka.

Let's assume you have consumers in your Kafka cluster that have no idea what is in the topic, they have to know exactly who/what has been produced into a topic. They can try the console consumer, and if it were a plaintext like JSON, then they have to figure out some fields they are interested in, then perform flaky HashMap-like .get("name") operations again and again, only to run into an NPE when a field doesn't exist. With Avro, you clearly define defaults and nullable fields.

You aren't required to use a Schema Registry, but it provides that type of explain topic semantics for the RDBMS analogy. It also saves you from needing to send the schema along with every message, and the expense of extra bandwidth on the Kafka topic. The registry is not only useful for Kafka, though, as it could be used for Spark, Flink, Hive, etc for all Data Science analysis surrounding streaming data ingest.


Assuming you did want to use JSON, then try using MsgPack instead and you'll likely see an increase in your Kafka throughput and save disk space on the brokers


You can also use other formats like Protobuf or Thrift, as Uber has compared

4
votes

It is a matter of speed and storage. When serializing data, you often need to transmit the actual schema and therefore, this cause an increase of payload size.

                            Total Payload Size
+-----------------+--------------------------------------------------+
|     Schema      |                 Serialised Data                  |
+-----------------+--------------------------------------------------+

Schema Registry provides a centralized repository for schemas and metadata so that all schemas are registered in a central system. This centralized system enables producers to only include the ID of the schema instead of the full schema itself (in text format).

                      Total Payload Size
+----+--------------------------------------------------+
| ID |                 Serialised Data                  |
+----+--------------------------------------------------+

Therefore, the serialisation becomes faster.

Furthermore, schema registry versioning enables the enforcement of data policies that might help to prevent newer schemas to break compatibility with existing versions that could potentially cause downtime or any other significant issues in your pipeline.


Some more benefits of Schema Registry are thoroughly explained in this article by Confluent.

1
votes

First of all - Kafka has no idea about the key/value content. It operates bytes and it's client (producer/consumer) responsibility to cake care of de/serialization.

The most common options so far seem to be JSON, protobuf and Avro.

What I personally like with Avro and why I usually use it and recommend to others:

1) It's a enough compact binary serialization, with a schema and logical types (which help distinguish just a regular long from timestamp in long millis)

2) the Avro schemas are very descriptive and perfectly documented

3) wide support among most of widely-used programming languages is a must!

4) Confluent (and others) provide a repository for schemas, a so-called "schema registry", to have a centralized storage for your schemas. In Avro, the message contains just the schema version ID, not the schema itself.

5) If you are using Java, you can have great benefits from using the POJO base class generation from the schema.

Sure you can have parts of these with other options. You should try and compare all the options that suite your use-case.

P.S. My very personal opinionated advice is: if it's not a String, go for Avro. Applies both for keys and values.