Spark-streaming for task parallelization

Question

I am designing a system with the following flow:

Download feed files (line based) over the network
Parse the elements into objects
Filter invalid / unnecessary objects
Execute blocking IO (HTTP Request) on part of the elements
Save to DB

I have been considering implementing the system using Spark-streaming mainly for tasks parallelization, resource management, fault tolerance, etc.

But I am not sure this is the right use-case for spark streaming, as I am not using it only for metrics and data processing. Also I'm not sure how Spark-streaming handles blocking IO tasks.

Is Spark-streaming suitable for this use-case? Or maybe I should look for another technology/framework?

maasg maasg · Accepted Answer · 2016-06-05T18:37:48

Spark is, at its heart, a general parallel computing framework. Spark Streaming adds an abstraction to support stream processing using micro-batching. We can certainly implement such an use case on Spark Streaming.

To 'fan-out' the I/O operations, we need to ensure the right level of parallelism at two levels:

First, distribute the data evenly across partitions: The initial partitioning of the data will depend on the streaming source used. For this usecase, it would look like a custom receiver could be the way to go. After the batch is received, we probably need to use dstream.repartition(n) to a larger number of partitions that should roughly match 2-3x the number of executors allocated for the job.
Spark uses 1 core (configurable) for each task executed. Tasks are executed per partition. This makes the assumption that our task is CPU intensive and requires a full CPU. To optimize execution for blocking I/O, we would like to multiplex that core for many operations. We do this by operating directly on the partitions and using classical concurrent programming to parallelize our work.

Given the original stream of feedLinesDstream, we could so something like: (* in Scala. Java version should be similar, but like x times more LOC)

val feedLinesDstream = ??? // the original dstream of feed lines
val parsedElements = feedLinesDstream.map(parseLine)
val validElements = parsedElements.filter(isValid _)
val distributedElements = validElements.repartition(n) // n = 2 to 3 x #of executors

// multiplex execution at the level of each partition
val data =  distributedElements.mapPartitions{ iter =>
   implicit executionContext = ??? // obtain a thread pool for execution
   val futures = iter.map(elem => Future(ioOperation(elem)))
   // traverse the future resulting in a future collection of results
   val res = Future.sequence(future) 
   Await.result(res, timeout)
}
data.saveToCassandra(keyspace, table)

Spark-streaming for task parallelization

2 Answers