2
votes

I have a batch Job to read the records From SQLServer and Write into MariaDB.Even though i have implemented the concept of partition in the batch process , the process is very slow

Below is the Datasource Configuration for source and target systems.

@Bean(name = "sourceSqlServerDataSource")
    public DataSource mysqlDataSource() {
        HikariDataSource hikariDataSource = new HikariDataSource();
        hikariDataSource.setMaximumPoolSize(100);
        hikariDataSource.setUsername(username);
        hikariDataSource.setPassword(password);
        hikariDataSource.setJdbcUrl(jdbcUrl);
        hikariDataSource.setDriverClassName(driverClassName);
        hikariDataSource.setPoolName("Source-SQL-Server");
        return hikariDataSource;
    } 

    @Bean(name = "targetMySqlDataSource")
    @Primary
    public DataSource mysqlDataSource() {
        HikariDataSource hikariDataSource = new HikariDataSource();
        hikariDataSource.setMaximumPoolSize(100);
        hikariDataSource.setUsername(username);
        hikariDataSource.setPassword(password);
        hikariDataSource.setJdbcUrl(jdbcUrl);
        hikariDataSource.setDriverClassName(driverClassName);
        hikariDataSource.setPoolName("Target-Myql-Server");
        return hikariDataSource;
    }

Below is the My Bean configured and thread pool taskexecutor

@Bean(name = "myBatchJobsThreadPollTaskExecutor")
    public ThreadPoolTaskExecutor initializeThreadPoolTaskExecutor() {
        ThreadPoolTaskExecutor threadPoolTaskExecutor = new ThreadPoolTaskExecutor();
        threadPoolTaskExecutor.setCorePoolSize(100);
        threadPoolTaskExecutor.setMaxPoolSize(200);
        threadPoolTaskExecutor.setThreadNamePrefix("My-Batch-Jobs-TaskExecutor ");
        threadPoolTaskExecutor.setWaitForTasksToCompleteOnShutdown(Boolean.TRUE);
        threadPoolTaskExecutor.initialize();
        log.info("Thread Pool Initialized with min {} and Max {} Pool Size",threadPoolTaskExecutor.getCorePoolSize(),threadPoolTaskExecutor.getMaxPoolSize() );
        return threadPoolTaskExecutor;
    }

Here are the step and partition step configured

@Bean(name = "myMainStep")
    public Step myMainStep() throws Exception{
        return stepBuilderFactory.get("myMainStep").chunk(500)
                .reader(myJdbcReader(null,null))
                .writer(myJpaWriter()).listener(chunkListener)
                .build();
    }

    @Bean
    public Step myPartitionStep() throws Exception {
        return stepBuilderFactory.get("myPartitionStep").listener(myStepListener)
                .partitioner(myMainStep()).partitioner("myPartition",myPartition)
                .gridSize(50).taskExecutor(asyncTaskExecutor).build();
    }

Updating the post with reader and writer

@Bean(name = "myJdbcReader")
    @StepScope
    public JdbcPagingItemReader myJdbcReader(@Value("#{stepExecutionContext[parameter1]}") Integer parameter1, @Value("#{stepExecutionContext[parameter2]}") Integer parameter2) throws Exception{
        JdbcPagingItemReader jdbcPagingItemReader = new JdbcPagingItemReader();
        jdbcPagingItemReader.setDataSource(myTargetDataSource);
        jdbcPagingItemReader.setPageSize(500);
        jdbcPagingItemReader.setRowMapper(myRowMapper());
        Map<String,Object> paramaterMap=new HashMap<>();
        paramaterMap.put("parameter1",parameter1);
        paramaterMap.put("parameter2",parameter2);
        jdbcPagingItemReader.setQueryProvider(myQueryProvider());
        jdbcPagingItemReader.setParameterValues(paramaterMap);
        return jdbcPagingItemReader;
    }

    @Bean(name = "myJpaWriter")
    public ItemWriter myJpaWriter(){
        JpaItemWriter<MyTargetTable> targetJpaWriter = new JpaItemWriter<>();
        targetJpaWriter.setEntityManagerFactory(localContainerEntityManagerFactoryBean.getObject());
        return targetJpaWriter;
    }

Can some one throw light on how to increase the performance of read write using Spring batch...?

1
Could you post reader/writer as well?StanislavL
@StanislavL Thanks for the reply. I have updated the post with reader and writerSonu Agarwal
Have you profiled? What is the bottle neck?Michael Minella
@MichaelMinella i am sorry, could you please guide/suggest on how to achieve profiling and find out the bottle neck.As i have implemented the step execution listener, the time difference between job start and end is 30mins for 250K recordsSonu Agarwal
1. Add necessary config to output sql in log produced to pull data. 2. Use a sql client to manually run sql and see Query Execution Plan. Look for table scans. If you see table scans maybe create an index. 3. Also confirm latency between app server and database server.Edwin M. Cruz

1 Answers

5
votes

Improving the performance of such an application depends on multiple parameters (grid size, chunk size, page size, thread pool size, db connection pool size, latency between db servers and your JVM, etc). So I can't give you a precise answer to your question but I will try to provide some guide lines:

  • Before starting to improve performance, you need to clearly define a baseline + target. Saying "it is slow" makes no sense. Get yourself ready with at least a JVM profiler and SQL client with a query execution plan analyser. Those are required to find the performance bottle neck either on your JVM or on your Database.
  • Setting the grid size to 50 and using a thread pool with core size = 100 means 50 threads will be created but not used. Make sure you are using the thread pool task executor in .taskExecutor(asyncTaskExecutor) and not a SimpleAsyncTaskExecutor which does not reuse threads.
  • 50 partitions for 250k records seems a lot to me. You will have 5000 records per partition, each partition will yield 10 transactions (since chunkSize = 500). So you will have 10 transactions x 50 partitions = 500 transactions between two databases servers and your JVM. This can be a performance issue. I would recommend to start with fewer partitions, 5 or 10 for example. Increasing concurrency does not necessarily mean increasing performance. There is always a break even point where your app will spend more time in context switching and dealing with concurrency rather than doing its business logic. Finding that point is an empirical process.
  • I would run any sql query outside of any Spring Batch job first to see if there is a performance issue with the query itself (query grabbing too much columns, too much records, etc) or with the db schema (missing index for example)
  • I would not use JPA/Hibernate for such an ETL job. Mapping data to domain objects can be expensive, especially if the O/R mapping is not optimized. Raw JDBC is usually faster in these cases.

There are a lot of other tricks like estimating an item size in memory and make sure the total chunk size in memory is < heap size to avoid unnecessary GC within a chunk, choosing the right GC algorithm for batch apps, etc but those are somehow advanced. The list of guide lines above is a good starting point IMO.

Hope this helps!