4
votes

I am using aws cli to create EMR cluster and adding a step. My create cluster command looks like :

aws emr create-cluster --release-label emr-5.0.0 --applications Name=Spark --ec2-attributes KeyName=*****,SubnetId=subnet-**** --use-default-roles --bootstrap-action Path=$S3_BOOTSTRAP_PATH --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large InstanceGroupType=CORE,InstanceCount=$instanceCount,InstanceType=m4.4xlarge --steps Type=Spark,Name="My Application",ActionOnFailure=TERMINATE_CLUSTER,Args=[--master,yarn,--deploy-mode,client,$JAR,$inputLoc,$outputLoc] --auto-terminate

$JAR - is my spark jar which takes two params input and output

$input is basically a comma separated list of input files like s3://myBucket/input1.txt,s3://myBucket/input2.txt

However, aws cli command treats comma separated values as separate arguments and hence my second parameter is being treated as second parameter and hence the $output here becomes s3://myBucket/input2.txt

Is there any way to escape comma and treat this whole argument as single value in CLI command so that spark can handle reading multiple files as input?

1

1 Answers

1
votes

Seems like there is no possible way of escaping comma from input files.

After trying quite a few ways, I finally had to put a hack by passing a delimiter for separating input files and handling the same in code. In my case,I added % as my delimiter and in Driver code, I am doing

if (inputLoc.contains("%")) {
  inputLoc = inputLoc.replaceAll("%", ",");
}