0
votes

We are experimenting to run Spark in our project without Hadoop and no distributed storage like HDFS. Spark is installed on a single node with 10 Cores and 16GB RAM and this node is not part of any cluster. Assuming Spark driver takes 2 cores and the rest of them are consumed by executors(2 each) at the time of execution.

If we process a big CSV file (of size 1 GB) stored in local disk in Spark as RDD and repartition it to 4 different partitions, will executors process each partition in parallel? What would executors do if we don't repartition the RDD to 4 diff partitions? Do we loose the power of distributed computing and parallelism if dont use HDFS?

1

1 Answers

1
votes

Spark caps the maximum size of a partition at 2G, so you should be able to process the entire data with minimal partitioning and quicker processing time. You can set spark.executor.cores to 8 so as to utilize all you resources.

Ideally, you should set the number of partitions depending on the size of your data, and you are better off setting the number of partitions as a multiple of cores/executors.

To answer your question, setting number of partitions to 4 in your case will probably result in each partition being sent to an executor. So yes, each partition will be processed in parallel.

If you don't repartition, then Spark will do it for you depending on the data and split the load between the executors.

Spark works perfectly fine without Hadoop. You might see a negligible performance drop since your files are on the local filesystem and not on HDFS, but for a file of size 1GB it really doesn't matter.