0
votes

I am in the process of implementing a spring batch job for our file upload process. My requirement is to read a flat file, apply business logic then store it in DB then post a Kafka message.

I have a single chunk-based step that uses a custom reader, processor, writer. The process works fine but takes a lot of time to process a big file.

It takes 15 mins to process a file having 60K records. I need to reduce it to less than 5 mins, as we will be consuming much bigger files than this.

As per https://docs.spring.io/spring-batch/docs/current/reference/html/scalability.html I understand making it multithreaded would give a performance boost, at the cost of restart ability. However, I am using FlatFileItemReader, ItemProcessor, ItemWriter and none of them is thread-safe.

Any suggestions as to how to improve performance here?

Here is the writer code:-

 public void write(List<? extends Message> items) {
        items.forEach(this::process);
    }
    
  private void process(Message message) {
        if (message == null)
            return;
        try {
           //message is a DTO that have info about success or failure.
            if (success) {
                //post kafka message using spring cloud stream
                //insert record in DB using spring jpaRepository
            } else {
                 //insert record in DB using spring jpaRepository
            }
        } catch (Exception e) {
           //throw exception
        }
    }

Best regards, Preeti

1
Before going to multi-threading or partitioning, have you profiled your current job? What is the value of the chunk size? Low values mean a lot of transactions which could be a performance issue. What is the bottle neck of your job? Is it you processing logic or the IO (read/write operations)? Those questions are really important to see if you really need to scale your job, and if yes, which scaling strategy to implement.Mahmoud Ben Hassine
Thanks @MahmoudBenHassine for getting back. I have defined chunk size as 500. I did try to log time metrics around reader, writer, processor. Writer was the one taking most of the time. Here are the micrometer stats generated by spring batch:-Writer (spring.batch.chunk.write) statistic: "TOTAL_TIME", value: 766.972706343 Process (spring.batch.item.process) statistic: "TOTAL_TIME", value: 3.238209216 Read (spring.batch.item.read) statistic: "TOTAL_TIME", value: 4.164657738Preeti
Thank you for the updates. Can you share your writer config? Also, which job repository do you use? The default Map-based job repository is probably slowing things down.Mahmoud Ben Hassine
Thank you. I am using default MapJobRegistry. Writer implements ItemWriter<?> . Updated my original post with writer's logic.Preeti
The map based job repository can be slow and is deprecated: github.com/spring-projects/spring-batch/issues/3780,I recommend using the JDBC based job repository. Moreover, your writer does not seem to use bulk updates: you are issuing a save operation for each item in a loop. You should do something like saveAll(items) to save all items at once in a single bulk operation. We introduced similar improvements in 4.3: docs.spring.io/spring-batch/docs/4.3.x/reference/html/… which you can use for inspiration.Mahmoud Ben Hassine

1 Answers

0
votes