Can Kafka-Spark Streaming pair be used for both batch+real time data?

Question

H All,

I am currently working on developing an architecture which should be able to handle both real time and batch data(coming from disparate sources and point solutions - third party tools). The existing architecture is old school and uses mostly RDBMS(I am not going to to go detail in that).

What I have come up with is two different pipeline - one for batch data(sqoop/spark/hive) and the other for real time data(kafka-spark stream).

But I have been told to use kafka-spark streaming pair for handling all kinds of data.

If anyone has any experience working on kafka-spark streaming pair for handling all kinds of data, could you please give me a brief details if this could be a viable solution and better than having two different pipeline.

Thanks in advance!

dossani dossani · Accepted Answer · 2019-09-30T17:33:56

Lambda architecture would be the way to go! Hope this link gives you enough ideas: https://dzone.com/articles/lambda-architecture-how-to-build-a-big-data-pipeli

Thanks much.

Can Kafka-Spark Streaming pair be used for both batch+real time data?

2 Answers