1
votes

                            Our current application having Spring batch Jobs works with RDBMS (Oracle). As part of strategic roadmap all the data will be in HIVE and there will be no dependency on Oracle (RDBMS). As part of this roadmap, we are trying to do POC to verify the feasibility of executing Spring Batch against Hive. However, when we have the HIVE JDBC driver configured and trying the deploy the application locally in JBOSS, we are getting the exception as "DatabaseType not found for product name: [Apache Hive]”. This issue was coming due the configurations of JobRepository and also JsrJobParametersConverteras both of them look for Database type of data source Product name. As we see the class org.springframework.batch.support.DatabaseType (spring-batch-infrastructure-4.0.0.RELEASE.jar) does not supports HIVE.

As we could not find any solution, we followed the guidelines (though limited) which are provided in the section Spring batch documentation "4.3.4 Non-standard Database Types in Repository"
https://docs.spring.io/spring-batch/docs/current/reference/html/index-single.html

  • Extended the JobRepositoryFactoryBean with customized class JobRepositoryFactoryBeanForHive
  • Implemented the various DAO interfaces on which SimpleJobRepository depends on as. This is done to have control on these DAO implementations as they are responsible for persisting batch metadata in database.

    1. JobInstanceDao (HiveJdbcJobInstanceDao)
    2. JobExecutionDao (HiveJdbcJobExecutionDao)
    3. StepExecutionDao (HiveJdbcStepExecutionDao)
    4. ExecutionContextDao (JdbcHiveExecutionContextDao)
  • Hive does not has the support for Sequences. As a workaround, Created a table to add / increment the Id and keep on retrieving the max value for each hit to table

  • Implemented HiveIncrementerFactory (factory for creating implementation of HiveIncrementer) and related HiveIncrementer (to retrieve the next value from the table created for sequence )

  • Modified the implementation of the method determineClobTypeToUse() in JobRepositoryFactoryBeanForHive to set Types to VARCHAR. In the database the field SERIALIZED_CONTEXT has been declared with datatype as VARCHAR as Hive does not support CLOB. At max, 2 GB can be stored. (As CLOB in Oracle can be 8 GB, should there be 3 fields created to store context by splitting if more than 2 GB in each of these fields)

    <bean id="jobRepository" class="com.batch.springutil.JobRepositoryFactoryBeanForHive">
          <property name="dataSource" ref="DataSource" />
          <property name="databaseType" value="oracle" />  
          <property name="incrementerFactory" ref="hiveIncrementerFactory" />
          <property name="transactionManager" ref="transactionManager" />
          <property name="isolationLevelForCreate" value="ISOLATION_DEFAULT" />
    </bean> 
    <bean id="hiveIncrementerFactory" class="com.batch.springutil.HiveIncrementerFactory">
              <constructor-arg ref="DataSource" /> 
    </bean>
    
  • Implemented customized class JsrJobParametersConverterHive extending JsrJobParametersConverter

    <bean id="jobParametersConverter" class="com.batch.springutil.JsrJobParametersConverterHive">
          <constructor-arg ref="BatchDataSource" />
    </bean>
    
1

1 Answers

0
votes

According to section 4.3.4 Non-standard Database Types in Repository from the docs:

If even that doesn’t work, or you are not using an RDBMS, then the only option may be to implement the various Dao interfaces that the SimpleJobRepository depends on and wire one up manually in the normal Spring way.

Since you have implemented the 4 DAOs that the job repository depends on, you can create a bean of type SimpleJobRepository and wire your DAOs in it. In other words, do not use the JobRepositoryFactoryBean and create the bean yourself:

@Bean
public SimpleJobRepository hiveJobRepository(
    HiveJdbcJobInstanceDao hiveJdbcJobInstanceDao,
    HiveJdbcJobExecutionDao hiveJdbcJobExecutionDao,
    HiveJdbcStepExecutionDao hiveJdbcStepExecutionDao,
    JdbcHiveExecutionContextDao jdbcHiveExecutionContextDao) {

    return new SimpleJobRepository(hiveJdbcJobInstanceDao, hiveJdbcJobExecutionDao,
                               hiveJdbcStepExecutionDao, jdbcHiveExecutionContextDao);
}