0
votes

I am building a spark structured streaming application where I have a dataframe with 1000s of rows. When i call writeStream().format("kafka"), it makes 1000 calls to kafka for transferring those individual row entries.

Is it possible to batch those 1000 messages and make a single call to Event hub (Enabled with KAFKA END POINT) ?

I have tried the below code by using eventhub Client library and calling createBatch() method. But it picks a collection of string entries and for that I need to either do a foreachBatch or collect on dataframe to use that library.

 EventDataBatch eventDataBatch = eventHubClient.createBatch();
 EventData eventData =  EventData.create(events.get(i).getBytes());
 boolean addedToBatch = eventDataBatch.tryAdd(eventData);

Any other better solution available for batching data and sending to event hub(With kafka end point) from spark structured streaming applications ?

1

1 Answers

0
votes

I am not 100% sure on how Kafka connector works in this scenario.

However, if you are trying to use Java SDK you can write data from a DataFrame to EH using write or writeStream options. https://github.com/Azure/azure-event-hubs-spark/blob/master/docs/structured-streaming-eventhubs-integration.md#writing-data-to-eventhubs