Refactoring a Spring Batch Job to use Apache Kafka (Decoupling readers and writers)

Question

I currently have a Spring Batch Job with one single step that reads data from Oracle , passes the data through multiple Spring Batch Processors (CompositeItemProcessor) and writes the data to different destinations such as Oracle and files (CompositeItemWriter) :

<batch:step id="dataTransformationJob">
    <batch:tasklet transaction-manager="transactionManager" task-executor="taskExecutor" throttle-limit="30">
        <batch:chunk reader="dataReader" processor="compositeDataProcessor" writer="compositeItemWriter" commit-interval="100"></batch:chunk>
    </batch:tasklet>
</batch:step>

In the above step, the compositeItemWriter is configured with 2 writers that run one after another and write 100 million records to Oracle as well as a file. Also, the dataReader has a synchronized read method to ensure that multiple threads don't read the same data from Oracle. This job takes 1 hour 30 mins to complete as of today.

I am planning to break down the above job into two parts such that the reader/processors produce data on 2 Kafka topics (one for data to be written to Oracle and the other for data to be written to a file). On the other side of the equation, I will have a job with two parallel flows that read data from each topic and write the data to Oracle and file respectively.

With the above architecture in mind, I wanted to understand how I can refactor a Spring Batch Job to use Kafka. I believe the following areas is what I would need to address :

In the existing job that doesn't use Kafka, my throttle limit is 30; however, when I use Kafka in the middle, how does one decide the right throttle-limit?
In the existing job I have a commit-interval of 100. This means that the CompositeItemWriter will be called for every 100 records and each writer will unpack the chunk and call the write method on it. Does this mean that when I write to Kafka, there will be 100 publish calls to Kafka?
Is there a way to club multiple rows into one single message in Kafka to avoid multiple network calls?
On the consumer side, I want to have a Spring batch multi-threaded step that is able to read each partition for a topic in parallel. Does Spring Batch have inbuilt classes to support this already?
The consumer will use standard JdbcBatchITemWriter or FlatFileItemWriter to write the data that was read from Kafka so I believe this should be standard Spring Batch in Action.

Note : I am aware of Kafka Connect but don't want to use it because it requires setting up a Connect cluster and I don't have the infrastructure available to support the same.

Rishabh Sharma Rishabh Sharma · Accepted Answer · 2020-09-02T08:21:36

Answers to your questions:

No throttling is needed in your kafka producer, data should be available in kafka for consumption asap. Your consumers could be throttled (if needed) as per the implementation.
Kafka Producer is configurable. 100 messages do not necessarily mean 100 network calls. You could write 100 messages to kafka producer (which may or may not buffer it as per the config) and flush the buffer to force network call. This would lead to (almost) the same existing behaviour.
Multiple rows can be clubbed in a single message as the payload of kafka message is entirely upto you. But your reasoning multiple rows into one single message in Kafka to avoid multiple network calls? is invalid since multiple messages (rows) can be produced/consumed in a single network call. For your first draft, I would suggest to keep it simple by having a single row correspond to a single message.
Not as far as I know. (but I could be wrong on this one)
Yes I believe they should work just fine.

Refactoring a Spring Batch Job to use Apache Kafka (Decoupling readers and writers)

1 Answers