When running a Spark ETL job on EMR, does the size of the master node instance matter? Based on my understanding, the master node does not handle processing/computation of data and is responsible for scheduling tasks, communicating with core and task nodes, and other admin tasks.
Does this mean if I have 10 TB of data that I need to transform and then write out, I can use 1 medium instance for master and 10 8xlarge for core nodes?
Based on reading, I see most people suggest master node instance type should be same as core instance type which I currently do and works fine. This would be 1 8xlarge for master and 10 8xlarge for core nodes.
According to AWS docs, we should use m4.large, so I'm confused what's right.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html
The master node does not have large computational requirements. For most clusters of 50 or fewer nodes, consider using an m4.large instance. For clusters of more than 50 nodes, consider using an m4.xlarge.
m4.xlargeis sufficient. We're runningm4.xlargein production for quite a while. - Bitswazsky