I'm interested in finding out how Spark implements fault tolerance. In their paper they describe how they do it for "narrow dependencies" like map which is fairly straight forward. However, I they do not state what they do if a node crashes after a wide dependency like a sort operation. The only thing I could find is this:
In contrast, in a lineage graph with wide dependencies, a single failed node might cause the loss of some partition from all the ancestors of an RDD, requiring a complete re-execution.
Which is not really enough for understanding what's happening.
After a sort, there is no way of telling where the data that was stored on the crashed node came from without storing some additional information. So if a crash happens after a sort, is the entire lineage re-executed or is there some mechanism reducing the computational overhead? And what about other wide dependencies?