3
votes

When running a Spark ETL job on EMR, does the size of the master node instance matter? Based on my understanding, the master node does not handle processing/computation of data and is responsible for scheduling tasks, communicating with core and task nodes, and other admin tasks.

Does this mean if I have 10 TB of data that I need to transform and then write out, I can use 1 medium instance for master and 10 8xlarge for core nodes?

Based on reading, I see most people suggest master node instance type should be same as core instance type which I currently do and works fine. This would be 1 8xlarge for master and 10 8xlarge for core nodes.

According to AWS docs, we should use m4.large, so I'm confused what's right.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-instances-guidelines.html

The master node does not have large computational requirements. For most clusters of 50 or fewer nodes, consider using an m4.large instance. For clusters of more than 50 nodes, consider using an m4.xlarge.

1
Updated answer. - thebluephantom
Depends on how many apps are running over there. If it's only only spark and hadoop, m4.xlarge is sufficient. We're running m4.xlarge in production for quite a while. - Bitswazsky
network speed, depending on what mode of spark did you run, i.e. cluster or client, what file you load and so on, many things are related. - Lamanus
That most people stuff is neither here or there. - thebluephantom

1 Answers

1
votes

The way the question is asked is a little vague. Size does matter, i.e. load, etc. So I answer it from a slightly different perspective. That "most people ..." stuff is neither here nor there.

The way the Master was assigned in the past was a weakness of the EMR approach imho when I trialled it some 9 mths ago for a PoC. Allocate big resources for Workers and by default 1 went to the Master which was complete overkill.

So, if you did things standardly, you paid for non-needed larger than req'd resource for the Master Node. There is a way to define a smaller resource for the Master, but I am on hols and cannot find it back again.

However, look at the url here and you see now that during EMR Cluster Config you can easily define a smaller Master Node or many such Master Nodes for fail over, things have moved along since I last looked: https://confusedcoders.com/data-engineering/how-to-create-emr-cluster-with-apache-spark-and-apache-zeppelin.

See also https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-launch.html for multiple such Master Nodes.

In general the Master Node can differ in terms of characteristics from the Workers, usually smaller, but may be not in all cases. That said, the EMR purpose would tend to point to smaller Master Node config.