MS SQL CDC with Kafka Connect and Apache Kafka

Question

In my current use case, I am using Spark core to read data from MS SQL Server and doing some processing on the data and sending it to Kafka every 1 minute, I am using Spark and Phoenix to maintain the CDC information in HBase table.

But this design has some issues, for e.g. if there is surge in MS SQL records Spark processing takes more time than batch interval and spark ends up sending duplicate records to Kafka.

As an alternate to this I am thinking of using Kafka Connect to read the messages from MS SQL and send records to Kafka topic and maintain the MS SQL CDC in Kafka. Spark Streaming will read records from Kafka topic and process the records and stores into HBase and send to other Kafka topics.

I have a few questions in order to implement this architecture:

Can I achieve this architecture with open source Kafka connectors and Apache Kafka 0.9 versions.
If yes can you please recommend me a GitHub project, which can offer me such connectors where I can CDC MS SQL tables using SQL query such as SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) and store records into Kafka topic.
Does Kafka connect supports Kerberos Kafka setup.

Have you tried github.com/jcustenborder/kafka-connect-cdc-mssql ? [1]: — gorros

Robin Moffatt Robin Moffatt · Accepted Answer · 2017-09-12T06:33:10

Can I achieve this architecture with open source Kafka connectors and Apache Kafka 0.9 versions.

Yes, Kafka Connect was released in version 0.9 of Apache Kafka. Features such as Single Message Transforms were not added until later versions though. If possible, you should be using the latest version of Apache Kafka (0.11)

If yes can you please recommend me a GitHub project, which can offer me such connectors where I can CDC MS SQL tables using SQL query such as SELECT * FROM SOMETHING WHERE COLUMN > ${lastExtractUnixTime}) and store records into Kafka topic.

You can use the JDBC Source which is available as part of Confluent Platform (or separately), and may also want to investigate kafka-connect-cdc-mssql

Does Kafka connect supports Kerberos Kafka setup.

Yes -- see here and here

Regarding this point :

Spark Streaming will read records from Kafka topic and process the records and stores into HBase and send to other Kafka topics.

You can actually use Kafka Connect here too -- there are Sinks available for HBase -- see the full list of connectors here. For further manipulation of data in Kafka, there is the Kafka Streams API, and KSQL.

MS SQL CDC with Kafka Connect and Apache Kafka

1 Answers