2
votes

I've integrated kafka and spark streaming after downloading from the apache website. However, I wanted to use Datastax for my Big Data solution and I saw you can easily integrate Cassandra and Spark.

But I can't see any kafka modules in the latest version of Datastax enterprise. How to integrate kafka with spark streaming here?

What I want to do is basically:

  • Start necessary brokers and servers
  • Start kafka producer
  • Start kafka consumer
  • Connect spark streaming to kafka broker and receive the messages from there

However after a quick google search, I can't see anywhere that kafka has been incorporated with datastax enterprise.

How can I achieve this? I'm really new to datastax and kafka and all so I need some advice. Language preference- Python. Thanks!

1
are you trying to read from kafka using spark-streaming? why would you care if it is part of Datastax enterprise or not?! - mamdouh alramadan
I'm trying to feed messages to kafka and read it from spark. kafka->spark. And I care because I wouldn't have to worry about external configuration, setup of kafka and connection dependencies. Which is the main reason datastax is famous for. - HackCode
That's not true at all, datastax adopted cassandra and they are providing DA solutions. Regardless, in case you don't need to manage kafka brokers yourself, you can use cloudera's solution for that (Not Recommended) as the cons outweighs the pros in this specific case. Your question is about integration (code-wise). The questions is confusing, and I believe you need to be more specific in order to get more helpful answers - mamdouh alramadan
My question is simply, there are no apache kafka modules in dse. Do we necessarily need to start kafka brokers and producers independently and connect it to dse version of spark or is there an easier way by dse? - HackCode
DSE does not provide kafka setup (AFAIK). Therefore, you need to setup kafka brokers yourself or as I mentioned earlier, through another 3rd party provider such as cloudera. once you have your brokers setup, in the bin directory you can run a producer (it has a light producer you can use for testing) and simply connect your spark-streaming to the brokers you have. I don't know if this answers your question, but let me know if I can be in any further assist - mamdouh alramadan

1 Answers

1
votes

Good question. DSE does not incorporate Kafka out of the box, you must set up kafka yourself and then set up your spark streaming job to read from kafka. Since DSE does bundle spark, use DSE Spark to run your spark streaming job.

You can use either the direct kafka API or kafka receivers, more details here on the tradeoffs. TL;DR direct api does not require WAL or zookeeper for HA.

Here is an example of how you can configure Kafka to work with DSE by Cary Bourgeois:

https://github.com/CaryBourgeois/DSE-Spark-Streaming/tree/master