Spring Batch Best Architecture to Read XML

Question

What is the Best performance architecture to read XML in Spring Batch? Each XML is approximately 300 KB size and we are processing 1 Million.

Our Current Approach

30 partitions and 30 Grids and Each slave gets 166 XMLS
Commit Chunk 100
Application Start Memory is 8 GB
Using JAXB in Reader Default Bean Scope

@StepScope
@Qualifier("xmlItemReader")
public IteratorItemReader<BaseDTO> xmlItemReader(
        @Value("#{stepExecutionContext['fileName']}") List<String> fileNameList) throws Exception {
    String readingFile = "File Not Found";
    logger.info("----StaxEventItemReader----fileName--->" + fileNameList.toString());
    List<BaseDTO> fileList = new ArrayList<BaseDTO>();
    for (String filePath : fileNameList) {
        try {
            readingFile = filePath.trim();
            Invoice bill = (Invoice) getUnMarshaller().unmarshal(new File(filePath));
            UnifiedInvoiceDTO unifiedDTO = new UnifiedInvoiceDTO(bill, environment);
            unifiedDTO.setFileName(filePath);
            BaseDTO baseDTO = new BaseDTO();
            baseDTO.setUnifiedDTO(unifiedDTO);
            fileList.add(baseDTO);
        } catch (Exception e) {
            UnifiedInvoiceDTO unifiedDTO = new UnifiedInvoiceDTO();
            unifiedDTO.setFileName(readingFile);
            unifiedDTO.setErrorMessage(e);
            BaseDTO baseDTO = new BaseDTO();
            baseDTO.setUnifiedDTO(unifiedDTO);
            fileList.add(baseDTO);
        }
    }
    return new IteratorItemReader<>(fileList);
}

Our questions:

Is this Archirecture correct
Is any performance or architecture advantage of using StaxEventItemReader and XStreamMarshaller over JAXB.
How to handle memory properly to avoid slow down

Have you considered creating a job per file? That's the best option IMO in terms of restartability, performance, scalability and all the good reasons of making one thing do one thing and do it well. — Mahmoud Ben Hassine
We are receiving XML filepath in a *.txt file. Each txt file have average of 5000 to max 10000 XML file path. Each txt file is creating a Job and 30 Slaves. </br> SELECT * FROM BATCH_JOB_EXECUTION WHERE JOB_EXECUTION_ID=13492; -- Per txt file SELECT count(STEP_EXECUTION_ID ) FROM BATCH_STEP_EXECUTION WHERE JOB_EXECUTION_ID=13492 AND STEP_NAME NOT IN('masterStep','moveFiles'); -- 31 slaves — Rakesh
Could you advise, Are you saying about creating a Job per *.txt file(Multiple XML file path in 1 txt file) and each slave partition will handle 1 XML. In this case BATCH_STEP_EXECUTION have large number of records as we have 5 million XMLS. — Rakesh
I added an answer with more details. Please accept it if it helps: stackoverflow.com/help/someone-answers. — Mahmoud Ben Hassine

Mahmoud Ben Hassine Mahmoud Ben Hassine · Accepted Answer · 2020-06-03T12:57:32

I would create a job per xml file by using the file name as a job parameter. This approach has many benefits:

Restartability: If a job fails, you only restart the failed file (from where it left off)
Scalability: This approach allows you to run multiple jobs in parallel. If a single machine is not enough, you can distribute the load on multiple machines
Logging: Logs are separate by design, you don't need to use an MDC or any other technique to separate logs

We are receiving XML filepath in a *.txt file

You can a create a script that iterates over these lines and launch a job per line (aka per file). Gnu Parallel (or a similar tool) is a good option to launch jobs in parallel.

Spring Batch Best Architecture to Read XML

Our Current Approach

Our questions:

1 Answers