0
votes

What is the Best performance architecture to read XML in Spring Batch? Each XML is approximately 300 KB size and we are processing 1 Million.

Our Current Approach

  1. 30 partitions and 30 Grids and Each slave gets 166 XMLS

  2. Commit Chunk 100

  3. Application Start Memory is 8 GB

  4. Using JAXB in Reader Default Bean Scope

@StepScope
@Qualifier("xmlItemReader")
public IteratorItemReader<BaseDTO> xmlItemReader(
        @Value("#{stepExecutionContext['fileName']}") List<String> fileNameList) throws Exception {
    String readingFile = "File Not Found";
    logger.info("----StaxEventItemReader----fileName--->" + fileNameList.toString());
    List<BaseDTO> fileList = new ArrayList<BaseDTO>();
    for (String filePath : fileNameList) {
        try {
            readingFile = filePath.trim();
            Invoice bill = (Invoice) getUnMarshaller().unmarshal(new File(filePath));
            UnifiedInvoiceDTO unifiedDTO = new UnifiedInvoiceDTO(bill, environment);
            unifiedDTO.setFileName(filePath);
            BaseDTO baseDTO = new BaseDTO();
            baseDTO.setUnifiedDTO(unifiedDTO);
            fileList.add(baseDTO);
        } catch (Exception e) {
            UnifiedInvoiceDTO unifiedDTO = new UnifiedInvoiceDTO();
            unifiedDTO.setFileName(readingFile);
            unifiedDTO.setErrorMessage(e);
            BaseDTO baseDTO = new BaseDTO();
            baseDTO.setUnifiedDTO(unifiedDTO);
            fileList.add(baseDTO);
        }
    }
    return new IteratorItemReader<>(fileList);
}

Our questions:

  1. Is this Archirecture correct
  2. Is any performance or architecture advantage of using StaxEventItemReader and XStreamMarshaller over JAXB.
  3. How to handle memory properly to avoid slow down
1
Have you considered creating a job per file? That's the best option IMO in terms of restartability, performance, scalability and all the good reasons of making one thing do one thing and do it well. - Mahmoud Ben Hassine
We are receiving XML filepath in a *.txt file. Each txt file have average of 5000 to max 10000 XML file path. Each txt file is creating a Job and 30 Slaves. </br> SELECT * FROM BATCH_JOB_EXECUTION WHERE JOB_EXECUTION_ID=13492; -- Per txt file SELECT count(STEP_EXECUTION_ID ) FROM BATCH_STEP_EXECUTION WHERE JOB_EXECUTION_ID=13492 AND STEP_NAME NOT IN('masterStep','moveFiles'); -- 31 slaves - Rakesh
Could you advise, Are you saying about creating a Job per *.txt file(Multiple XML file path in 1 txt file) and each slave partition will handle 1 XML. In this case BATCH_STEP_EXECUTION have large number of records as we have 5 million XMLS. - Rakesh
I added an answer with more details. Please accept it if it helps: stackoverflow.com/help/someone-answers. - Mahmoud Ben Hassine

1 Answers

1
votes

I would create a job per xml file by using the file name as a job parameter. This approach has many benefits:

  • Restartability: If a job fails, you only restart the failed file (from where it left off)
  • Scalability: This approach allows you to run multiple jobs in parallel. If a single machine is not enough, you can distribute the load on multiple machines
  • Logging: Logs are separate by design, you don't need to use an MDC or any other technique to separate logs

We are receiving XML filepath in a *.txt file

You can a create a script that iterates over these lines and launch a job per line (aka per file). Gnu Parallel (or a similar tool) is a good option to launch jobs in parallel.