I tried looking through the various posts but did not get an answer. Lets say my spark job has 1000 input partitions but I only have 8 executor cores. The job has 2 stages. Can someone help me understand exactly how spark processes this. If you can help answer the below questions, I'd really appreciate it
- As there are only 8 executor cores, will spark process the Stage 1 of my job 8 partitions at a time?
- If the above is true, after the first set of 8 partitions are processed where is this data stored when spark is running the second set of 8 partitions?
- If I dont have any wide transformations, will this cause a spill to disk?
- For a spark job, what is the optimal file size. I mean spark better with processing 1 MB files and 1000 spark partitions or say a 10MB file with 100 spark partitions?
Sorry, if these questions are vague. This is not a real use case but as I am learning about spark I am trying to understand the internal details of how the different partitions get processed.
Thank You!