2
votes

I'm using Spark extensively, the core of Spark is the RDD, and as shown in the RDD paper there are limitations when it comes to streaming applications. This is an exact quote from the RDD paper.

As discussed in the Introduction, RDDs are best suited for batch applications that apply the same operation to all elements of a dataset. In these cases, RDDs can ef- ficiently remember each transformation as one step in a lineage graph and can recover lost partitions without having to log large amounts of data. RDDs would be less suitable for applications that make asynchronous finegrained updates to shared state, such as a storage system for a web application or an incremental web crawler

I don't quite understand why the RDD can't effectively manage state. How does Spark Streaming overcome these limitations?

2
You may get indirect answer ( why flink is preferred for streaming over Spark) for this query in stackoverflow.com/questions/28082581/…. Have a look at articles & ppts quoted in that question. - Ravindra babu

2 Answers

5
votes

I don't quite understand why the RDD can't effectively manage state.

It is not really about being able on not but more about the cost. We have well established mechanisms of handling finegrained changes with Write-ahead logging but managing logs is just expensive. These have to written to persistent storage, periodically merged and require expensive replaying in case of failure.

Compared to that RDDs are extremely lightweight solution. It is just a small local data structure which has to remember only its lineage (ancestors and applied transformations).

It does it mean it is not possible to create at least partially stateful system on top of Spark. Take a look at the Caffe-on-Spark architecture.

How does Spark Streaming overcome these limitations?

It doesn't or to be more precise it handles this problem externally independent of RDD abstraction. It includes using input and output operations with source specific guarantees and a fault-tolerant storage for handling received data.

1
votes

It's explained elsewhere in the paper:

Existing abstractions for in-memory storage on clusters, such as distributed shared memory [24], key- value stores [25], databases, and Piccolo [27], offer an interface based on fine-grained updates to mutable state (e.g., cells in a table). With this interface, the only ways to provide fault tolerance are to replicate the data across machines or to log updates across machines. Both approaches are expensive for data-intensive workloads, as they require copying large amounts of data over the cluster network, whose bandwidth is far lower than that of RAM, and they incur substantial storage overhead.

In contrast to these systems, RDDs provide an interface based on coarse-grained transformations (e.g., map, filter and join) that apply the same operation to many data items. This allows them to efficiently provide fault tolerance by logging the transformations used to build a dataset (its lineage) rather than the actual data.1 If a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to recompute just that partition. Thus, lost data can be recovered, often quite quickly, without requiring costly replication.

As I interpret that, handling streaming applications would require the system to do lots of writing to individual cells, shoving data across the network, i/o, and other costly things. RDDs are meant to avoid all that stuff by primarily supporting functional-type operations that can be composed.

This is consistent with my recollection from about 9 months ago when I did a Spark-based MOOC on edx (sadly haven't touched it then)---as I remember, Spark doesn't even bother to compute the results of maps on RDDs until the user actually calls for some output, and that way saves a ton of computation.