Produce Avro topic to Kafka using Apache Spark

Question

I have installed kafka locally (no cluster/schema registry for now) and trying to produce an Avro topic and below is the schema associated with that topic.

{
  "type" : "record",
  "name" : "Customer",
  "namespace" : "com.example.Customer",
  "doc" : "Class: Customer",
  "fields" : [ {
    "name" : "name",
    "type" : "string",
    "doc" : "Variable: Customer Name"
  }, {
    "name" : "salary",
    "type" : "double",
    "doc" : "Variable: Customer Salary"
  } ]
}

I would like to create a simple SparkProducerApi to create some data based on the above schema and publish it to kafka. Thinking of creating sample data converting to dataframe and then change it to avro and then publish it.

val df = spark.createDataFrame(<<data>>)

And then, something like below:

df.write
  .format("kafka")
  .option("kafka.bootstrap.servers","localhost:9092")
  .option("topic","customer_avro_topic")
  .save()
}

Attaching schema to this avro topic can be done manually for now.

Can this be done just by using Apache Spark APIs instead of using Java/Kafka Apis? This is for batch processing instead of streaming.

OneCricketeer OneCricketeer · Accepted Answer · 2019-04-21T16:08:31

I don't think this is directly possible because the Kafka producer in Spark expects two columns of key and value, both of which must be byte arrays.

If you read an existing Avro file from disk, the Avro dataframe reader you have likely creates two columns for the name and salary. Therefore, you would need one operation to construct a value column from the others containing the whole Avro record, then drop those other columns, and then you must serialize it to a byte array using a library like Bijection, for example since you're not using the Schema Registry.

If you want to generate data and don't have a file, then you'd need to build a list of Tuple2 objects for the Kafka message key and values that are byte arrays, then you could parallelize those to an RDD, then convert them into a Dataframe. But at that point, just using the regular Kafka Producer API is much simpler.

Plus, if you already know your schema, try the project mentioned in Ways to generate test data in Kafka

Produce Avro topic to Kafka using Apache Spark

1 Answers