does EMR cluster size matters to read data from S3 using spark

Question

Setup: latest (5.29) AWS EMR, spark, 1 master 1 node.

step 1. I have used S3Select to parse a file & collect all file keys for pulling from S3. step 2. Use pyspark iterate the keys in a loop and do the following

spark .read .format("s3selectCSV") .load(key) .limit(superhighvalue) .show(superhighvalue)

It took be x number of minutes.

When I increase the cluster to 1 master and 6 nodes, I am not seeing difference in time. It appears to me that I am not using the increased core nodes.
Everything else, config wise are defaults out of the box, I am not setting anything.

So, my question is does cluster size matters to read and inspect (say log or print) data from S3 using EMR, Spark?

ravi ravi · Accepted Answer · 2020-02-04T06:41:06

Few thing to keep in mind.

are you sure that the executors have indeed increased because of increase of nodes? or u can specify them during spark submit --num-executors 6. MOre nodes doenst mean nore executors are spinned.
next thing, wht is the size of csv file? some 1MB? then u will not see much difference. Make sure to have atleast 3-4 GB

does EMR cluster size matters to read data from S3 using spark

2 Answers