0
votes

We are working on building the Kafka-connect application using JDBC source connector in increment+timestamp mode. We tried the Standalone mode and It is working as expected. Now, we would like to switch to distributed mode.

When we have a single Hive table as a source, How the tasks will be distributed among the workers?

The problem we faced was when we run the application in multiple instances, It is querying the table for every instance and fetching the same rows again. Does parallelism will work in this case? If so,
How does the tasks will co-ordinate with each other on the current status of table ?

1
If possible you may also add some code snippet which you trying. - Swapnil
for different jdbc you need to write different way code to achieve parallelism. You have to use row_number or row_count or something else in-built fucntion of databases, this way you can achieve parallelism. Have a look of spark-jdbc connector, you will get idea. - mahendra singh
I believe only one task is given to each table anyway. Also you may want to put your source data into Kafka, and then put into Hive and another place... Hive queries aren't exactly quick - OneCricketeer

1 Answers

0
votes

The parameter tasks.max doesn't have any difference for the kafka-connect-jdbc source/sink connector. There is no occurrence of this property in the source code of the jdbc connector project.

Consult JDBC source config options for the available properties for this connector.