I currently have a Spring Batch Job with one single step that reads data from Oracle , passes the data through multiple Spring Batch Processors (CompositeItemProcessor) and writes the data to different destinations such as Oracle and files (CompositeItemWriter) :
<batch:step id="dataTransformationJob">
<batch:tasklet transaction-manager="transactionManager" task-executor="taskExecutor" throttle-limit="30">
<batch:chunk reader="dataReader" processor="compositeDataProcessor" writer="compositeItemWriter" commit-interval="100"></batch:chunk>
</batch:tasklet>
</batch:step>
In the above step, the compositeItemWriter is configured with 2 writers that run one after another and write 100 million records to Oracle as well as a file. Also, the dataReader has a synchronized read method to ensure that multiple threads don't read the same data from Oracle. This job takes 1 hour 30 mins to complete as of today.
I am planning to break down the above job into two parts such that the reader/processors produce data on 2 Kafka topics (one for data to be written to Oracle and the other for data to be written to a file). On the other side of the equation, I will have a job with two parallel flows that read data from each topic and write the data to Oracle and file respectively.
With the above architecture in mind, I wanted to understand how I can refactor a Spring Batch Job to use Kafka. I believe the following areas is what I would need to address :
- In the existing job that doesn't use Kafka, my throttle limit is 30; however, when I use Kafka in the middle, how does one decide the right throttle-limit?
- In the existing job I have a commit-interval of 100. This means that the
CompositeItemWriterwill be called for every 100 records and each writer will unpack the chunk and call the write method on it. Does this mean that when I write to Kafka, there will be 100 publish calls to Kafka? - Is there a way to club multiple rows into one single message in Kafka to avoid multiple network calls?
- On the consumer side, I want to have a Spring batch multi-threaded step that is able to read each partition for a topic in parallel. Does Spring Batch have inbuilt classes to support this already?
- The consumer will use standard JdbcBatchITemWriter or FlatFileItemWriter to write the data that was read from Kafka so I believe this should be standard Spring Batch in Action.
Note : I am aware of Kafka Connect but don't want to use it because it requires setting up a Connect cluster and I don't have the infrastructure available to support the same.