Our use case is (1) consuming data from ActiveMQ, (2) performing transformations through a general purpose reusable streaming process, and then (3) publishing to Kafka. In our case, step (2) would be a reusable Spark Streaming 'service' that would provide an event_source_id, enrich each record with metadata, and then publish to Kafka.
The straightforward approach I see is ActiveMQ -> Flume -> Spark Streaming -> Kafka.
Flume seems like an unnecessary extra step and network traffic. As far as I can tell, a Spark Streaming custom receiver would provide a more general solution for ingestion into hadoop (step 1), and, allows more flexibility for transforming the data as it is an inherent step for Spark Streaming itself, the downside being a loss of coding ease.
I would love to gain some insight from my more experienced peers as we are in the beginning stages of transforming a large data architecture; please help with any suggestions/insights/alternatives you can think of.
Thank you world