Spark is, at its heart, a general parallel computing framework. Spark Streaming adds an abstraction to support stream processing using micro-batching.
We can certainly implement such an use case on Spark Streaming.
To 'fan-out' the I/O operations, we need to ensure the right level of parallelism at two levels:
First, distribute the data evenly across partitions:
The initial partitioning of the data will depend on the streaming source used. For this usecase, it would look like a custom receiver could be the way to go. After the batch is received, we probably need to use dstream.repartition(n) to a larger number of partitions that should roughly match 2-3x the number of executors allocated for the job.
Spark uses 1 core (configurable) for each task executed. Tasks are executed per partition. This makes the assumption that our task is CPU intensive and requires a full CPU. To optimize execution for blocking I/O, we would like to multiplex that core for many operations. We do this by operating directly on the partitions and using classical concurrent programming to parallelize our work.
Given the original stream of feedLinesDstream, we could so something like:
(* in Scala. Java version should be similar, but like x times more LOC)
val feedLinesDstream = ??? // the original dstream of feed lines
val parsedElements = feedLinesDstream.map(parseLine)
val validElements = parsedElements.filter(isValid _)
val distributedElements = validElements.repartition(n) // n = 2 to 3 x #of executors
// multiplex execution at the level of each partition
val data = distributedElements.mapPartitions{ iter =>
implicit executionContext = ??? // obtain a thread pool for execution
val futures = iter.map(elem => Future(ioOperation(elem)))
// traverse the future resulting in a future collection of results
val res = Future.sequence(future)
Await.result(res, timeout)
}
data.saveToCassandra(keyspace, table)