Our current application having Spring batch Jobs works with RDBMS (Oracle). As part of strategic roadmap all the data will be in HIVE and there will be no dependency on Oracle (RDBMS). As part of this roadmap, we are trying to do POC to verify the feasibility of executing Spring Batch against Hive. However, when we have the HIVE JDBC driver configured and trying the deploy the application locally in JBOSS, we are getting the exception as "DatabaseType not found for product name: [Apache Hive]”. This issue was coming due the configurations of JobRepository and also JsrJobParametersConverteras both of them look for Database type of data source Product name. As we see the class org.springframework.batch.support.DatabaseType (spring-batch-infrastructure-4.0.0.RELEASE.jar) does not supports HIVE.
As we could not find any solution, we followed the guidelines (though limited) which are provided in the section Spring batch documentation "4.3.4 Non-standard Database Types in Repository"
https://docs.spring.io/spring-batch/docs/current/reference/html/index-single.html
- Extended the
JobRepositoryFactoryBeanwith customized classJobRepositoryFactoryBeanForHive Implemented the various DAO interfaces on which
SimpleJobRepositorydepends on as. This is done to have control on these DAO implementations as they are responsible for persisting batch metadata in database.- JobInstanceDao (
HiveJdbcJobInstanceDao) - JobExecutionDao (
HiveJdbcJobExecutionDao) - StepExecutionDao (
HiveJdbcStepExecutionDao) - ExecutionContextDao (
JdbcHiveExecutionContextDao)
- JobInstanceDao (
Hive does not has the support for Sequences. As a workaround, Created a table to add / increment the Id and keep on retrieving the max value for each hit to table
Implemented
HiveIncrementerFactory(factory for creating implementation ofHiveIncrementer) and relatedHiveIncrementer(to retrieve the next value from the table created for sequence )Modified the implementation of the method
determineClobTypeToUse()inJobRepositoryFactoryBeanForHiveto set Types toVARCHAR. In the database the fieldSERIALIZED_CONTEXThas been declared with datatype asVARCHARas Hive does not supportCLOB. At max, 2 GB can be stored. (As CLOB in Oracle can be 8 GB, should there be 3 fields created to store context by splitting if more than 2 GB in each of these fields)<bean id="jobRepository" class="com.batch.springutil.JobRepositoryFactoryBeanForHive"> <property name="dataSource" ref="DataSource" /> <property name="databaseType" value="oracle" /> <property name="incrementerFactory" ref="hiveIncrementerFactory" /> <property name="transactionManager" ref="transactionManager" /> <property name="isolationLevelForCreate" value="ISOLATION_DEFAULT" /> </bean> <bean id="hiveIncrementerFactory" class="com.batch.springutil.HiveIncrementerFactory"> <constructor-arg ref="DataSource" /> </bean>Implemented customized class
JsrJobParametersConverterHiveextendingJsrJobParametersConverter<bean id="jobParametersConverter" class="com.batch.springutil.JsrJobParametersConverterHive"> <constructor-arg ref="BatchDataSource" /> </bean>