Why aren't RDDs suitable for streaming tasks?

Question

I'm using Spark extensively, the core of Spark is the RDD, and as shown in the RDD paper there are limitations when it comes to streaming applications. This is an exact quote from the RDD paper.

As discussed in the Introduction, RDDs are best suited for batch applications that apply the same operation to all elements of a dataset. In these cases, RDDs can ef- ficiently remember each transformation as one step in a lineage graph and can recover lost partitions without having to log large amounts of data. RDDs would be less suitable for applications that make asynchronous finegrained updates to shared state, such as a storage system for a web application or an incremental web crawler

I don't quite understand why the RDD can't effectively manage state. How does Spark Streaming overcome these limitations?

You may get indirect answer ( why flink is preferred for streaming over Spark) for this query in stackoverflow.com/questions/28082581/…. Have a look at articles & ppts quoted in that question. — Ravindra babu

zero323 zero323 · Accepted Answer · 2016-03-06T03:05:20

I don't quite understand why the RDD can't effectively manage state.

It is not really about being able on not but more about the cost. We have well established mechanisms of handling finegrained changes with Write-ahead logging but managing logs is just expensive. These have to written to persistent storage, periodically merged and require expensive replaying in case of failure.

Compared to that RDDs are extremely lightweight solution. It is just a small local data structure which has to remember only its lineage (ancestors and applied transformations).

It does it mean it is not possible to create at least partially stateful system on top of Spark. Take a look at the Caffe-on-Spark architecture.

How does Spark Streaming overcome these limitations?

It doesn't or to be more precise it handles this problem externally independent of RDD abstraction. It includes using input and output operations with source specific guarantees and a fault-tolerant storage for handling received data.

Why aren't RDDs suitable for streaming tasks?

2 Answers