0
votes

Our use case is (1) consuming data from ActiveMQ, (2) performing transformations through a general purpose reusable streaming process, and then (3) publishing to Kafka. In our case, step (2) would be a reusable Spark Streaming 'service' that would provide an event_source_id, enrich each record with metadata, and then publish to Kafka.

The straightforward approach I see is ActiveMQ -> Flume -> Spark Streaming -> Kafka.

Flume seems like an unnecessary extra step and network traffic. As far as I can tell, a Spark Streaming custom receiver would provide a more general solution for ingestion into hadoop (step 1), and, allows more flexibility for transforming the data as it is an inherent step for Spark Streaming itself, the downside being a loss of coding ease.

I would love to gain some insight from my more experienced peers as we are in the beginning stages of transforming a large data architecture; please help with any suggestions/insights/alternatives you can think of.

Thank you world

1

1 Answers

0
votes

In theory, Flume should help you better create a more efficient ingestion to HDFS. If using Spark Streaming, depending on how much you set up in your microbatch, it could not be that efficient - but if your use case needs more real time, then I think you could do it with Spark Streaming directly, yes.

Most applications would want to store the original data in HDFS so as to be able to refer to it back. Flume would help with that - but if you don't have that need, you may want to skip it. Also, you could always persist your RDD in Spark at any point.

Also, if you want to consume in realtime, you may want to look to Storm.

Your use case is weakly defined though, so more info on the constraints (volume, time requirements, how do you want to expose this info, etc.) would help to get more concrete answers.

EDIT: Here there is a link where they go from a 1-hour Flume + Hadoop, to another one on 5 seconds cycles - still using Flume to help with ingestion scalability. So it's up to your use case to use Flume there or not... I'd say it makes sense to separate the ingestion layer if you want that data to e.g. be consolidated in a lambda-like architecture.