0
votes

Setup: latest (5.29) AWS EMR, spark, 1 master 1 node.

step 1. I have used S3Select to parse a file & collect all file keys for pulling from S3. step 2. Use pyspark iterate the keys in a loop and do the following

spark .read .format("s3selectCSV") .load(key) .limit(superhighvalue) .show(superhighvalue)

It took be x number of minutes.

When I increase the cluster to 1 master and 6 nodes, I am not seeing difference in time. It appears to me that I am not using the increased core nodes.
Everything else, config wise are defaults out of the box, I am not setting anything.

So, my question is does cluster size matters to read and inspect (say log or print) data from S3 using EMR, Spark?

2

2 Answers

0
votes

Few thing to keep in mind.

  1. are you sure that the executors have indeed increased because of increase of nodes? or u can specify them during spark submit --num-executors 6. MOre nodes doenst mean nore executors are spinned.
  2. next thing, wht is the size of csv file? some 1MB? then u will not see much difference. Make sure to have atleast 3-4 GB
0
votes

Yes, size does matter. For my use case, sc.parallelize(s3fileKeysList), parallelize turned out to be the key.