I wrote a simple spark-streaming application which basically reads stream of events from Kafka and stores these events to Cassandra in a table allowing efficient queries over these data. The main purpose of this job is to process current real-time data. But there are also historical events stored in hdfs.
I want to reuse the code processing RDDs (a part of the streaming job) in a historical job and I am wondering what is the best solution for reading historical data according to the following requirements:
- Historical events are stored in daily rolled up files in hdfs (I want to run the job on a range of historical files)
- It would be nice to have a possibility of pausing the job (inserts to cassandra are idempotent, so I need at-least-once processing)
- I want to have some throttling mechanism allowing to define maximum number of events which can be processed (in a period of time: e.g every 1min)
I've considered two approaches so far:
- Batch Spark job
- Ad1: Is there a better way to define RDD based on range of files than creating one RDD for each file and then union them?
- Ad2,3: Is it possible?
- Spark Streaming job
- Ad1: How to efficiently define a range of input files? Sth better than using
ssc.textFileStream(inputDir)
and copying files which I want to process to this directory? - Ad2: I assume that setting a checkpoint directory is what I am looking for.
- Ad3 I plan to use
spark.streaming.receiver.maxRate
property
- Ad1: How to efficiently define a range of input files? Sth better than using
Am I right that regular batch spark cannot meet my requirements? I am waiting for your advice regarding to spark streaming solution.