1
votes

I have a business case of Merge Multiple csv files(around 1000+ Each containing 1000 records )into Single csv using Spring batch .

Please help me provide your guidance and solutions in terms of approach and performance-wise as well.

So far, I have tried two approaches,

Approach 1.

Tasklet chunk with multiResourceItemReader to read the files from directory and FlatFileItemWriter as item writer.

Issue here is, it is very slow in processing since this is single threaded, but approach works as expected.

Approach 2: Using MultiResourcePartitioner partitioner and AsynTaskExceutor as task-executor.

Issue here is, since it is async multi-thread, data is getting overwritten/ corrupted while merging into final single file.

2
You need to show as what you have tried so far or according to you what you think can be a better approach as per your knowledge of Spring Batch framework? This will help to get better answers. - Sabir Khan
updated with solutions tried from my side. @SabirKhan - Sada Shiv Dash
Are you doing any processing on source csv records ( like filtering etc ) or is it a simpl file merge with all headers being common ? - Sabir Khan
@SabirKhan - No filtering, it is simple files merge into one file with all common headers - Sada Shiv Dash
Since there is no filtering/processing and all files have the same structure, then Approach 1 should be ok (even if single threaded). What do you mean by slow, can you give some numbers? Have you tried different values for the commit-interval? That said, do you really need Spring Batch for such a simple task? Something like cat *.csv >> all.csv or equivalent should do the trick (and should be faster). - Mahmoud Ben Hassine

2 Answers

0
votes

You can wrap your FlatFileItemWriter in AsyncItemWriter and use along with AsyncItemProcessor. This will not corrupt your data and increase the performance as processing and writing will be through several threads.

@Bean
    public AsyncItemWriter asyncItemWriter() throws Exception {
        AsyncItemWriter<Customer> asyncItemWriter = new AsyncItemWriter<>();

        asyncItemWriter.setDelegate(flatFileItemWriter);
        asyncItemWriter.afterPropertiesSet();

        return asyncItemWriter;
    }

@Bean
    public AsyncItemProcessor asyncItemProcessor() throws Exception {
        AsyncItemProcessor<Customer, Customer> asyncItemProcessor = new AsyncItemProcessor();

        asyncItemProcessor.setDelegate(itemProcessor());
        asyncItemProcessor.setTaskExecutor(threadPoolTaskExecutor());
        asyncItemProcessor.afterPropertiesSet();

        return asyncItemProcessor;
    }

@Bean
    public TaskExecutor threadPoolTaskExecutor() {

        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(10);
        executor.setMaxPoolSize(10);
        executor.setThreadNamePrefix("default_task_executor_thread");
        executor.initialize();
        return executor;

    }
0
votes

Since your headers are common between your source and destination files, I wouldn't recommend using Spring Batch provided readers to convert lines into specific beans since column level information is not needed & csv being a text format , you can go ahead only with line level info without breaking it at field level.

Also, partitioning per file is going to be a very slow ( if you have those many files ) & you should try by first fixing your number of partitions ( like 10 or 20 ) and try grouping your files into those many partitions. Secondly file writing being a disk based operation & not CPU based, multi threading won't be useful.

What I suggest instead is to write your custom reader & writer in plain Java on the lines as suggested in this answer where your reader will return a List<String> and writer will get List<List<String>> & that you can write to file.

If you have enough memory to hold lines from all files in one go then you can read all files in one go & keep returning chunk_size or you can keep reading small set of files to reach chunk size limit should be good enough. Your reader will return null when no more files to read.