Setup: latest (5.29) AWS EMR, spark, 1 master 1 node.
step 1. I have used S3Select to parse a file & collect all file keys for pulling from S3. step 2. Use pyspark iterate the keys in a loop and do the following
spark .read .format("s3selectCSV") .load(key) .limit(superhighvalue) .show(superhighvalue)
It took be x number of minutes.
When I increase the cluster to 1 master and 6 nodes, I am not seeing difference in time. It appears to me that I am not using the increased core nodes.
Everything else, config wise are defaults out of the box, I am not setting anything.
So, my question is does cluster size matters to read and inspect (say log or print) data from S3 using EMR, Spark?