0
votes

I have a social platform where users have the option of creating new posts. That post data goes to the database as of now. I want to implement a streaming algorithm to process these new posts using spark streaming.

Query 1: So, I want to know how to send these new posts from the database to the spark streaming architecture? I would like to know whether I should use kafka as the middleman here (maybe scalable in the future I think) or just stream the data from the database to spark streaming via some socket (if yes, please tell me how).

The databases used are firebase and mongodb (would be better if the procedure is explained for both).

Query 2: I started learning about kafka. Its mentioned that it can process stream posts. So, why not use kafka to process streams instead of spark streaming? Why people mostly use kafka just as message broker and not for processing streams?

Thanks in advance.

1

1 Answers

1
votes

Query 1 : You can use any architecture for start, it totally depends how much you can experiment and time you have. Using Kafka as middlemen is always good to have when there is lot of data flowing through. Mongo can be used directly with some connectors like in this article - https://databricks.com/blog/2015/03/20/using-mongodb-with-spark.html

Query 2 : Spark was build entirely for processing data while Kafka was build as messaging system and latter evolved for other use cases. So there is no one answer for not using kafka directly as data processor. Its like why should we not use Elasticsearch as truth of data instead of some DB. It might take sometime for a technology to stabilize. https://dzone.com/articles/spark-streaming-vs-kafka-stream-1

Hope this helps!