improve spring batch job performance

Question

I am in the process of implementing a spring batch job for our file upload process. My requirement is to read a flat file, apply business logic then store it in DB then post a Kafka message.

I have a single chunk-based step that uses a custom reader, processor, writer. The process works fine but takes a lot of time to process a big file.

It takes 15 mins to process a file having 60K records. I need to reduce it to less than 5 mins, as we will be consuming much bigger files than this.

As per https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html I understand making it multithreaded would give a performance boost, at the cost of restart ability. However, I am using FlatFileItemReader, ItemProcessor, ItemWriter and none of them is thread-safe.

Any suggestions as to how to improve performance here?

Here is the writer code:-

 public void write(List<? extends Message> items) {
        items.forEach(this::process);
    }
    
  private void process(Message message) {
        if (message == null)
            return;
        try {
           //message is a DTO that have info about success or failure.
            if (success) {
                //post kafka message using spring cloud stream
                //insert record in DB using spring jpaRepository
            } else {
                 //insert record in DB using spring jpaRepository
            }
        } catch (Exception e) {
           //throw exception
        }
    }

Best regards, Preeti

Before going to multi-threading or partitioning, have you profiled your current job? What is the value of the chunk size? Low values mean a lot of transactions which could be a performance issue. What is the bottle neck of your job? Is it you processing logic or the IO (read/write operations)? Those questions are really important to see if you really need to scale your job, and if yes, which scaling strategy to implement. — Mahmoud Ben Hassine
Thanks @MahmoudBenHassine for getting back. I have defined chunk size as 500. I did try to log time metrics around reader, writer, processor. Writer was the one taking most of the time. Here are the micrometer stats generated by spring batch:-Writer (spring.batch.chunk.write) statistic: "TOTAL_TIME", value: 766.972706343 Process (spring.batch.item.process) statistic: "TOTAL_TIME", value: 3.238209216 Read (spring.batch.item.read) statistic: "TOTAL_TIME", value: 4.164657738 — Preeti
Thank you for the updates. Can you share your writer config? Also, which job repository do you use? The default Map-based job repository is probably slowing things down. — Mahmoud Ben Hassine
Thank you. I am using default MapJobRegistry. Writer implements ItemWriter<?> . Updated my original post with writer's logic. — Preeti
The map based job repository can be slow and is deprecated: github.com/spring-projects/spring-batch/issues/3780,I recommend using the JDBC based job repository. Moreover, your writer does not seem to use bulk updates: you are issuing a save operation for each item in a loop. You should do something like saveAll(items) to save all items at once in a single bulk operation. We introduced similar improvements in 4.3: docs.spring.io/spring-batch/docs/4.3.x/reference/html/… which you can use for inspiration. — Mahmoud Ben Hassine

Rakesh Rakesh · Accepted Answer · 2021-03-02T10:03:58

Please refer to below SO thread and refer the git hub source code for parallel processing

Spring Batch multiple process for heavy load with multiple thread under every process

Spring batch to process huge data

improve spring batch job performance

1 Answers