2
votes

With Spark streaming, I can read Kafka messages and write data to different kind of tables, for example HBase, Hive and Kudu. But this can also be done by using Kafka connectors for these tables. My question is, in which situations I should prefer connectors over the Spark streaming solution.

Also how tolerant is the Kafka connector solution? We know that with Spark streaming, we can use checkpoints and executors running on multiple nodes for fault tolerant execution, but how is fault tolerance (if possibe) achieved with Kafka connectors? By running the connector on multiple nodes?

2

2 Answers

2
votes

in which situations I should prefer connectors over the Spark streaming solution.

"It Depends" :-)

  1. Kafka Connect is part of Apache Kafka, and so has tighter integration with Apache Kafka in terms of security, delivery semantics, etc.
  2. If you don't want to write any code, Kafka Connect is easier because it's just JSON to configure and run
  3. If you're not using Spark already, Kafka Connect is arguably more straightforward to deploy (run the JVM, pass in the configuration)
  4. As a framework, Kafka Connect is more transferable since the concepts are the same, you just plugin the appropriate connector for the technology that you want to integrate with each time
  5. Kafka Connect handles all the tricky stuff for you like schemas, offsets, restarts, scaleout, etc etc etc
  6. Kafka Connect supports Single Message Transform for making changes to data as it passes through the pipeline (masking fields, dropping fields, changing data types, etc etc). For more advanced processing you would use something like Kafka Streams or ksqlDB.
  7. If you are using Spark, and it's working just fine, then it's not necessarily prudent to rip it up to use Kafka Connect instead :)

Also how tolerant is the Kafka connector solution? … how is fault tolerance (if possibe) achieved with Kafka connectors?

  1. Kafka Connect can be run in distributed mode, in which you have one or more worker processes across nodes. If a worker fails, Kafka Connect rebalances the tasks across the remaining ones. If you add a worker in, Kafka Connect will rebalance to ensure workload distribution. This was drastically improved in Apache Kafka 2.3 (KIP-415)
  2. Kafka Connect uses the Kafka consumer API and tracks offsets of records delivered to a target system in Kafka itself. If the task or worker fails you can be sure that it will restart from the correct point. Many connectors support exactly-once delivery too (e.g. HDFS, Elasticsearch, etc)

If you want to learn more about Kafka Connect see the docs here and my talk here. See a list of connectors here, and tutorial videos here.


Disclaimer: I work for Confluent and a big fan of Kafka Connect :-)

5
votes

So, generally, there should be no big difference in functionality when it comes to simply reading records from Kafka and sending them into other services.

Kafka Connect is probably easier when it comes to standard tasks since it offers various connectors out-of-the-box, so it will quite probably reduce the need of writing any code. So, if you just want to copy a bunch of records from Kafka to HDFS or Hive then it will probably be easier and faster to do with Kafka connect.

Having this in mind, Spark Streaming drastically takes over when You need to do things that are not standard i.e. if You want to perform some aggregations or calculations over records and write them to Hive, then You probably should go for Spark Streaming from the beginning.

Genrally, I found doing some substandard things with Kafka connect, like for example splitting one message to multiple ones(assuming it was for example JSON array) to be quite troublesome and often require much more work than it would be in Spark.

As for the Kafka Connect fault tolerance, as it's described in the docs this is achieved by running multiple distributed workers with same group.id, the workers redistribute tasks and connectors if one of them fails.