0
votes

I had a rookie spark question. Here's the program that I'm trying to execute:

dataset
 .join(anotherDataset)
 .repartition(7000)
 .flatmap(doComputation)
 .persist(StorageLevel.MEMORY_AND_DISK)
 .saveToS3()

The computation that I'm trying to do(doComputation) is expensive and because of memory limits, I decided to repartition the dataset into 7000 partitions(even though I have 1200 executors, so only 1200 execute at a time). However, after doing the computations, I try to write to s3, which works fine mostly but few of the tasks end up timing out and the job is retried.

1) Why does the entire job gets retried as I persist the RDD generated after doing the expensive computation ?

2) I try to coalesce after the persist, but then spark just ignores the repartition and executes 500 tasks only:

dataset
 .join(anotherDataset)
 .repartition(7000)
 .flatmap(doComputation)
 .persist(StorageLevel.MEMORY_AND_DISK)
 .coalesce(500)
 .saveToS3()

Is there any way that I can do repartition(7000)-compute(on those 7000 partitions)-coalesce(500) and then write to s3 ?

1

1 Answers

0
votes
  1. Share the DAG visualization of the job? You should see some exceptions (lost executors) etc. in stderr.
  2. Coalesc is supposed to minimize data movement while repartitioning. It is ignoring repartition because coalesc is repartitioning with minimizing shufffling.