3
votes

I'm running into a problem with running my application on EMR master node. It needs to access some AWS SDK methods added in ver 1.11. All the required dependencies were bundled into a fat jar and the application works as expected on my dev box.

However, if the app is executed on EMR master node, it fail with NoSuchMethodError exception when calling a method, added in AWS SDK ver 1.11+, e.g.

java.lang.NoSuchMethodError:
 com.amazonaws.services.sqs.model.SendMessageRequest.withMessageDeduplicationId(Ljava/lang/String;)Lcom/amazonaws/services/sqs/model/SendMessageRequest;

I tracked it down to the classpath parameter passed to JVM instance, started by spark-submit:

-cp /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf/:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/lib/spark/conf/:/usr/lib/spark/jars/*:/etc/hadoop/conf/

In particular, it loads /usr/share/aws/aws-java-sdk/aws-java-sdk-sqs-1.10.75.1.jar instead of using ver 1.11.77 from my fat jar.

Is there a way to force Spark to use the AWS SDK version I need?

1
It looks like spark.executor.userClassPathFirst set to true should allow your provided jar to override the classpath params: spark.apache.org/docs/latest/configuration.html - Dave Maple
@DaveMaple : I tried adding --conf spark.driver.userClassPathFirst=true to the spark-submit command line. My app exists almost immediately with Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: class org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not org.apache.hadoop.security.GroupMappingServiceProvider. It looks like a version conflict to me. - Denis Makarenko
darn. yeah. i guess we'd have to be selective then about only the aws sdk. will think on this. - Dave Maple
Shading (i.e. using an alternative name) for the latest version of com.amazonaws.services.sqs package doesn't work either. It turned out that AmazonSQSClient.init() calls HandlerChainFactory.newRequestHandlerChain("/com/amazonaws/services/sqs/request.handlers") i.e. it uses a hard-coded package name and so it can't find the renamed one. - Denis Makarenko

1 Answers

2
votes

Here is what I learned trying to troubleshoot this.

The default class path parameter is constructed using spark.driver.extraClassPath settings from /etc/spark/conf/spark-defaults.conf. spark.driver.extraClassPath contains a reference to the older version AWS SDK, which is located in /usr/share/aws/aws-java-sdk/*

To use the newer version of AWS API, I uploaded the jars to a dir I created in the home dir and specified it in --driver-class-path spark-submit parameter:

--driver-class-path '/home/hadoop/aws/*'